from h5glance import H5Glance # Browsing HDF5 files
H5Glance("data.h5")
import h5py # Pythonic HDF5 wrapper: https://docs.h5py.org/
h5file = h5py.File("data.h5", mode="r") # Open HDF5 file in read mode
data = h5file["/data"][()] # Access HDF5 dataset "/data"
%matplotlib inline
from matplotlib import pyplot as plt
plt.imshow(data, cmap="gray")
<matplotlib.image.AxesImage at 0x7f0896e60760>
data = h5file["/compressed_data"][()] # Access compressed dataset
--------------------------------------------------------------------------- OSError Traceback (most recent call last) Input In [4], in <cell line: 1>() ----> 1 data = h5file["/compressed_data"][()] File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper() File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper() File ~/venvs/py310/lib/python3.10/site-packages/h5py/_hl/dataset.py:758, in Dataset.__getitem__(self, args, new_dtype) 756 if self._fast_read_ok and (new_dtype is None): 757 try: --> 758 return self._fast_reader.read(args) 759 except TypeError: 760 pass # Fall back to Python read pathway below File h5py/_selector.pyx:376, in h5py._selector.Reader.read() OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)
hdf5plugin
usage¶To enable reading compressed datasets not supported by libHDF5
and h5py
:
Install hdf5plugin & import it.
%%bash
pip3 install hdf5plugin
# Or:
conda install -c conda-forge hdf5plugin
Or on Debian12 and Ubuntu23.04:
%%bash
apt-get install python3-hdf5plugin
import hdf5plugin
data = h5file["/compressed_data"][()] # Access datset
plt.imshow(data, cmap="gray") # Display data
<matplotlib.image.AxesImage at 0x7f089182d480>
h5file.close() # Close the HDF5 file
When writing datasets with h5py
, compression can be specified with: h5py.Group.create_dataset
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
# Create a compressed dataset
h5file = h5py.File("new_file_blosc2_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data",
data=data,
compression=32026, # Blosc2 HDF5 filter identifier
# options: 0, 0, 0, 0, level, filter, compression
compression_opts=(0, 0, 0, 0, 5, 2, 1)
)
h5file.close()
hdf5plugin
provides some helpers to ease dealing with compression filter and options:
h5file = h5py.File("new_file_blosc2_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data",
data=data,
compression=hdf5plugin.Blosc2(
cname='lz4',
clevel=5,
filters=hdf5plugin.Blosc2.BITSHUFFLE),
)
h5file.close()
help(hdf5plugin.Blosc2)
H5Glance("new_file_blosc2_bitshuffle_lz4.h5")
h5file = h5py.File("new_file_blosc2_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data"][()], cmap="gray")
h5file.close()
!ls -sh new_file*.h5
18M new_file_blosc2_bitshuffle_lz4.h5 19M new_file_uncompressed.h5
h5py
¶Pre-compression filter: Byte-Shuffle provided by libhdf5
Compression filters provided by h5py:
libhdf5
: "gzip" and eventually "szip" (optional)h5py
: "lzf"h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
"/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()
hdf5plugin
¶Additional compression filters provided by hdf5plugin
:
BitShuffle, Blosc, Blosc2, BZip2, FciDecomp, LZ4, SZ, SZ3, ZFP, Zstandard
10 out of the 29 HDF5 registered filter plugins as of September 2023
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data_bitshuffle_lz4",
data=data,
compression=hdf5plugin.Bitshuffle()
)
h5file.close()
(u)int8
or (u)int16
SZcompressor family: error-bounded lossy compression
float32
, float64
, (u)int32
, (u)int64
blosclz
, lz4
, lz4hc
, snappy
(optional, requires C++11), zlib
, zstd
blosclz
, lz4
, lz4hc
, zlib
, zstd
zfp
, htj2k
Blosc
and Blosc2
includes some pre-compression filters and algorithms provided by other HDF5 compression filters:
Blosc2(..., filters=Blosc2.SHUFFLE)
Bitshuffle()
=> Blosc2("lz4" or "zstd", 5, Blosc2.BITSHUFFLE)
LZ4()
=> Blosc2("lz4", 9)
Zstd()
=> Blosc2("zstd", 2)
Blosc2
filter could also provides ZFP
(not included yet in hdf5plugin
)
Machine: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (40 cores)
Filesystem: /dev/shm
hdf5plugin
built from source
Running on a single core
Diffraction tomography dataset: 100 frames from http://www.silx.org/pub/pyFAI/pyFAI_UM_2020/data_ID13/kevlar.h5
Dataset: 100x2167x2070, uint16, chunk: 2167x2070
Some filters can use multithreading:
BLOSC_NTHREADS
environment variableOMP_NUM_THREADS
environment variablePerformance do not increase linearly with the number of CPU cores used.
Having different pre-compression filters and compression algorithms at hand offers different read/write speed versus compression rate (and eventually error rate) trade-offs.
Also to keep in mind availability/compatibility: Since "gzip"
is included in libHDF5
it is the most compatible one (and also "lzf"
as included in h5py
).
hdf5plugin
filters with other applications¶Set the HDF5_PLUGIN_PATH
environment variable to: hdf5plugin.PLUGINS_PATH
%%bash
export HDF5_PLUGIN_PATH=`python3 -c "
import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
echo "HDF5_PLUGIN_PATH=${HDF5_PLUGIN_PATH}"
ls ${HDF5_PLUGIN_PATH}
Note: Only works for reading compressed datasets, not for writing!
hdf5plugin
license¶The source code of hdf5plugin
itself is licensed under the MIT license...
It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib...) and copyrights.
hdf5plugin
relies on a "hack" to avoid linking with libhdf5Compression filters allocates a memory buffer to store decompressed data = memory copies.
Allowing the user to pass a memory buffer through h5py->libhdf5->compression_filter would prevent it.
An example with h5py and Blosc2 (bitshuffle+lz4) for a 8.5MB chunk on 1 core (± ~300 µs):
dataset[()]
: 8.9 msread_direct()
to existing array: 5.2 msread_direct_chunk()
& decompression with blosc2
: 3.7 msTo hdf5plugin contributors: Armando Sole, @orioltinto, @mkitti , @Florian-toll, Jerome Kieffer, @fpwg, @mobiusklein , @junyuewang, @Anthchirp, and
to all contributors of embedded libraries
Partially funded by the LEAPS-INNOV and PaNOSC EU-project.
This project has received funding from the European Union´s Horizon 2020 research and innovation programme under grant agreement no. 101004728 and 823852.