2023 European HDF User Group (HUG) plugins and data compression summit

Name: 2023 European HDF User Group (HUG) plugins and data compression summit
Start: 2023-09-19T08:30:00+02:00
End: 2023-09-21T17:00:00+02:00
Location: DESY

19–21 Sept 2023

DESY

Europe/Berlin timezone

Local organiser

david.pennicard@desy.de

Discussion notes

Below is a copy of notes made by collaborative note-taking during the workshop.

Sep 19, morning session

HDF5 and plugins - overview and roadmap - Dana Robinson

consider to establish benchmarks with hdf5
- decide on datasets and scores (e.g. https://en.wikipedia.org/wiki/Weissman_score – well, see the “Limitations” section and Talk-page comments about this particular score being “Made for TV” and unit-dependent since log(1) gives 0.)
Argonne has website to drop data sample, get back report
- ??? (please paste URL)
how about encrypting HDF5 files with filters?
- encryption filters can be written, however they would only encrypt the payload (h5 dataset), but not the metadata
- github repos out there to encrypt
  - ??? (please paste URL)

Expanding HDF5 capabilities to support multi-threading access and new types of storage - Elena Pourmal

have you used instructions for sparse arrays respective in hdf5?
- yes, this is being considered
in future: reduce file size without rewriting complete file (i.e. with sparse arrays)?
- not sure this is possible, if you e.g. remove datasets, this will create holes in the file storage
- best would be to rewrite
comment: having multi-threaded HDF5 or netCDF library would be super great
- helpful for many applications (also considering parallel i/o)
- Lifeboat plans on having a release for the community to test in Spring 2024
question on page 12, “Results HDF5 VOL Connector Prototype”: how to understand reading metadata?
- optimisation done, bring all metadata of file into memory
- related to hdf5 metadata (not experimental meta data)

Recent improvements in the HDF5/Blosc2 plugin systems - Francesc Alted

blosc2 plugin mentioned: https://github.com/Blosc/blosc2_openhtj2k

depends on https://github.com/osamu620/OpenHTJ2K

blosc2 guarantees the filters to be future-proof?

blosc team trying to be future-proof (high priority in our standards)
see blog post on 2018-02-21: https://www.blosc.org/posts/new-forward-compat-policy/
in case, something is introduced that breaks fwd/bwd compatibility, teams responds quickly to revert such a change
someone asked about compatibility policy in Blosc. this blog shows how we reacted when we had a forward compatibility issue some years ago: https://www.blosc.org/posts/new-forward-compat-policy/

how do you control, if new plugin (like openhtj2k) is thread-safe?

multiple approaches to add multi-threading
openhtj2k != thread-safe, cannot use it right now
for small block sizes, compression rations suffer -> no use to do multi-threading
original author works on thread-safety, already exposed in blosc2-plugin metadata

what is the policy to decide which codecs go into blosc2 versus which become a plugin?

intention is to avoid code bloat of core blosc2
advertise plugins more as it is decoupled from blosc2 core development
this way, community has more ownership of development
hope: more people can use plugins

re btune: how does it choose the best codec?

btune parameter: how to balance the speed versus compression ratio, this can help to control a trade-off

how does btune respects the effect of different hardware on compression speed?

tricky topic, at this moment we are not dealing with this effect
compression been can be different (local laptop versus btune server)
use similar platform for testing as will be used in practice

question was also about the filters. Does Blosc guanratee that the standard filters will always be working in the future i.e. can Blosc always read a file in the future if one of the standard filters are used to compress the original file?

totallly. one of the reasons for including sources fro standard codecs/filters is to have control on forward compatibility. we are extremely committed to that goal
another thing would be a dynamic filter that is supported by an external user/group. we cannot make any guarantees in this case indeed

GPU processing of HDF5 data - Jerome Kieffer

nvcomp: https://developer.nvidia.com/nvcomp

opencl lz4: https://www.silx.org/doc/silx/dev/Tutorials/codec/Bitshuffle-LZ4.html

hdf5plugin does currently switch off OpenMP: when using numpy in parallel, I’d like to use OpenMP nonetheless.

numpy will not use OpenMP by default
linalg libs however would use OpenMP by default

hint: great tool for profiling python https://github.com/P403n1x87/austin/ (sample based, interfaces to flamegraph and more)

Processing HDF5 data with FPGAs - Zdenek Matej

FPGAs are traditionally not used in the photon science data processing
Zdenek got interested when he saw the potential after a few days hacking
issues with installation and programming have become much easier and are not a blocking point anymore
implemented azimuthal integration on FPGA
BSLZ4 compression on FPGA is still WIP

Sep 19, afternoon session

NetCDF Compression Improvements - Edward Hartnett

quantization i.e. limiting the precision by setting bits to zero, with zlib works well
netcdf has a community project for data compression plugins CCR (?)

hdf5plugin: Use HDF5 compression filters from Python - Thomas Vincent

generic solution for installing compression filters
discussion topic for tomorrow on where hdf5plugin should be i.e. in h5py or libhdf,

Current and upcoming challenges for data packaging of DECTRIS X-ray detectors - Max Burian et al

implemented a FileWriter (hdf5 writer) to implement parallel file writing
limited use of virtual dataset because full use poses problems for processing
support the NXmx Nexus standard
virtual datasets do not automatically solve parallel writing issues
raw data rates will increase >5 up to 200 Gbps of compressed data
available bandwidth will always be used
compression algorithms need to be improved continuously for each architecture
need to save only the data which are scientifically useful

Compression Plugins in h5wasm - Brian Maranville

good explanation on what h5wasm is and how it is generated with emscripten
compression filters zstd + lz4 once the right symbols are exported
possible now to build more plugins, need help from experts in cmake, emscripten, …

Sep 20, morning session

h5cpp and pninexus c++ libraries - Jan Kotanski

a generic C++ API for HDF5 and Nexus
pninexus provides an xml builder for generating files following the Nexus structure
io of nexus files: how to handle writing strings? (in HDF5 there are multiple ways to write strings) How do you handle cases, where people try to read nexus files which where written outside of your ecosystem?
- yes, we try to support them
- additional traits can be provided if string formats mismatch
  - sometimes individual handlers need to be provided
what cpp library do you use for XML handling (read/write)
- use boost xml parser for now
can STL containers also be used for writing/reading data?
- yes, adapters are provided and are handled by templates inside h5cpp/pninexus
how does this compare to other c++ bindings of hdf5 libs?
- not sure and never compared
- Elena: all c++ libs/wrappers are different, coverage of hdf5 API is not comprehensive
- Jan: limited to parts of API that are useful for us

openPMD - the open Standard for Particle Mesh Data - Franz Pöschel

a metadata standard and ecosystem which supports HDF5, ADIOS, JSON as binary format

reference implementation is openPMD-api as an open stack for scientific I/O

HELPMI project will enable interoperability between Nexus and openPMD

I/O performance is not scaling anymore for HPC similar to for camera data rates

most files compressed or all are compressed?

compression only available for ADIOS2 files
recently implemented chunking HDF5 (and compression subsequently)

mass and charge set as attributes? page 6/33

charge and mass are arrays, but can be replaced with attributes if the value is constant across all particles of the same species (dito for a mesh, e.g. constant temperature)
good idea: use attributes for constant properties, datasets for non-constants (PS: not sure I captured this right)

Is OpenPMD specific to geometries, i.e. can you add custom properties of particles?

standard allows extensions, predefines a small set of common geometries

NexusCreator & iCAT - Helmholtz-Zentrum Berlin applying FAIR data management - Hector Perez Ponce

BESSY beamlines store data in different ways but wip to standardise on NeXus and register them in ICAT

Nexus is an ontology cf. https://nexusformat.org, nxstruct is a structure for defining the layout

using coffeescript (python syntax in javascript) and h5wasm to produce nexus/hdf5 files

supports NOMAD to NeXus/HDF5 conversion, Kafka to Nexus/HDF5, Bluesky to Nexus/HDF5

example on U41-PEAUX RIXS beamline at BESSY

ongoing work is

is there a web application for NexusCreator?

NexusCreator is written in JavaScript effectively?

yes, due to my background
is there a web application?
- not at the moment, this is likely planned

for reproducibility: can you go back from nexus to the original?

idea: in the future processing software will ingest nexus directly
raw data is stored in the icat to be able to revert for now

at DESY: discussion to extract metadata from iCat; how do you do that at HZB?

iCat: ???

convert generic nexus file to application definition: what if not all data is available?

we don’t have everything in fact
we currently are FAIRready here, so for some parts of the data we are FAIR, for some we are not
if you are missing required data, not put inside post-hoc

speaker: if people would love to have access to the nexuscreator repo, please send Hector Perez Ponce a message

do you support encryption?

currently implementing compression
encryption not available

Is there some validation when data is stored?

people provide raw data, then we chat, explain nexus standard, write file “manually” together
plan to automate this? (possibly similar beamlines could be combined)
- yes, considering it
- in the future would be great

comment: in nomad environment, nexus file is validated and checked if file content match

Using Sparse Arrays for Synchrotron 3D-XRD-CT Data Reduction - Jonathan Wright

tool for sample based profiling: https://github.com/benfred/py-spy

how to organize data processing, i.e. the pipelining?

on slurm cluster: jupyter notebooks (users can connect through a browser with this)
jupyter notebooks process dataset from memory (requires large enough node)
users can only touch data after experimental run is finished
compression slows things down, could say: don’t do compression then
crucial: reliable code when experiment is running -> processing can run later

Q: pinning processing to numa nodes came up multiple times now. Just wondering, SLURM can be configured to be NUMA aware, i.e. it would place and pin processes in jobs using cgroups, wouldn’t that be a solution?

taskset and pinning doesn’t solve all problems

take data from EIGER directly and process directly?

would be possible
disk storage is the limiting factor
could be possible as a LIMA plugin -> human resources

Data Reduction in serial crystallography - Marina Galchenkova

lossless compression does not provide enough compression for SC data

using lossy compression means it is crucial to check the loss in quality of results, resolution is key

reprocessing raw data allows new pipelines to reach higher resolution

saving only the peaks will result in worse resolution

the easiest solution is to save a few significant bits!

need a store more low intensity peaks, saving 8 bit floating like representation

binning + 3 sign bit gives up to 36.3 CR

work has been presented at a few conferences + is implemented at P11 (DESY) and applied to ESRF+APS data

conclusion: concern about data reduction, can you choose the way the data is reduced during the experiment?

yes and no
yes: based on existing pipeline, check data quality pipelines
no: methods like binning

slide on binning, quality scores get better with compressed data, maybe the compression removed noise (when using the EIGER 16M)

right/wrong detector is tricky to answer
a lot of details depend on external factors (beam quality) and experiment design (geometry)
aim: process data and resolve peaks at highest resolution possible, remain flexible for different types of samples

what does binning mean?

take 2x2 pixels and sum up

how to accumulate the frames?

[not sure I understood] frame by frame accumulation is used, not a 3D volume as in https://doi.org/10.1107/S2052252514022313

Lossless and Lossy Compression for Photon Science - Peter Steinbach

presented results of summary at the photon sources

3 techniques need much higher compression ratio viz. SX, XPCS, laser shots

signal processing before compression can become key e.g. binning, quantisation,

btune helped improve the parameters for the lossless compression to 3 - 4 CR

LEAPS-INNOV tried to find the best parameters for compression i.e. how to interpret MSSIM

neural compressors impose new challenges because need to store model, sampling distirbution + encoding

did you notice a change in scientists acceptance of lossy compression over the last few years?

not really, but increasing costs for storage and energy will eventually impose limits which will force scientists to accept lossy compression

Sep 20, afternoon session

HDF Compression for data service architectures - John Readey

cloud has become a thing which solves some of the issues in the HDF5 library
bandwidth performance comes from having many requests simultaneously
can scale by adding more nodes
contrarily to regular hdf5 library, hsds can compress variable length data

Discussion

Where to put common filters?

List the use cases

Long term usability (10 years)

What tooling is required?

How does a data consumer get a hold of missing add-ins?

Improve the error messages when a filter is not available

Which kind of HDF5 Metadata can be exposed to help in the overall Data FAIRisation?

What governance, where, and who?

Science
Policy
- Certification
- Change management
Technical

Availability of plugins in 10 yrs

Distribution of artifacts (e.g., Francesc Python wheels, .NET Nuget)

Mechanism for a few key filters to have similar support to GZip? (Promotion of some filters)

Participants

Data producers
Data consumers
Filter developers
Plugin developers
Integrators (e.g., h5py), 3rd party developers (e.g., MathWorks)

What is a filter vs. a plugin?

Provide skeleton(s)

Survey of filter popularity?

Testing

Regression
Performance
“Exotic hardware”

Committee suggestions

Dana R
Elena P
Peter Steinbach
Thomas Vincent
Commercial? DECTRIS?
MATLAB
Francesc Alted

Funding?

EU open calls next year - EOSC
Zuckerburg / Amazon? This fall
NSF

Miscellaneous complaints

Iterators
Plugins don’t work directly
- Delta filter structure?
Filter memory overhaul

Choose timezone