Below is a copy of notes made by collaborative note-taking during the workshop.
Sep 19, morning session
HDF5 and plugins - overview and roadmap - Dana Robinson
- consider to establish benchmarks with hdf5
- decide on datasets and scores (e.g. https://en.wikipedia.org/wiki/Weissman_score – well, see the “Limitations” section and Talk-page comments about this particular score being “Made for TV” and unit-dependent since log(1) gives 0.)
- Argonne has website to drop data sample, get back report
- ??? (please paste URL)
- how about encrypting HDF5 files with filters?
- encryption filters can be written, however they would only encrypt the payload (h5 dataset), but not the metadata
- github repos out there to encrypt
- ??? (please paste URL)
Expanding HDF5 capabilities to support multi-threading access and new types of storage - Elena Pourmal
- have you used instructions for sparse arrays respective in hdf5?
- yes, this is being considered
- in future: reduce file size without rewriting complete file (i.e. with sparse arrays)?
- not sure this is possible, if you e.g. remove datasets, this will create holes in the file storage
- best would be to rewrite
- comment: having multi-threaded HDF5 or netCDF library would be super great
- helpful for many applications (also considering parallel i/o)
- Lifeboat plans on having a release for the community to test in Spring 2024
- question on page 12, “Results HDF5 VOL Connector Prototype”: how to understand reading metadata?
- optimisation done, bring all metadata of file into memory
- related to hdf5 metadata (not experimental meta data)
Recent improvements in the HDF5/Blosc2 plugin systems - Francesc Alted
blosc2 plugin mentioned: https://github.com/Blosc/blosc2_openhtj2k
- depends on https://github.com/osamu620/OpenHTJ2K
blosc2 guarantees the filters to be future-proof?
- blosc team trying to be future-proof (high priority in our standards)
- see blog post on 2018-02-21: https://www.blosc.org/posts/new-forward-compat-policy/
- in case, something is introduced that breaks fwd/bwd compatibility, teams responds quickly to revert such a change
- someone asked about compatibility policy in Blosc. this blog shows how we reacted when we had a forward compatibility issue some years ago: https://www.blosc.org/posts/new-forward-compat-policy/
how do you control, if new plugin (like openhtj2k) is thread-safe?
- multiple approaches to add multi-threading
- openhtj2k != thread-safe, cannot use it right now
- for small block sizes, compression rations suffer -> no use to do multi-threading
- original author works on thread-safety, already exposed in blosc2-plugin metadata
what is the policy to decide which codecs go into blosc2 versus which become a plugin?
- intention is to avoid code bloat of core blosc2
- advertise plugins more as it is decoupled from blosc2 core development
- this way, community has more ownership of development
- hope: more people can use plugins
re btune: how does it choose the best codec?
- btune parameter: how to balance the speed versus compression ratio, this can help to control a trade-off
how does btune respects the effect of different hardware on compression speed?
- tricky topic, at this moment we are not dealing with this effect
- compression been can be different (local laptop versus btune server)
- use similar platform for testing as will be used in practice
question was also about the filters. Does Blosc guanratee that the standard filters will always be working in the future i.e. can Blosc always read a file in the future if one of the standard filters are used to compress the original file?
- totallly. one of the reasons for including sources fro standard codecs/filters is to have control on forward compatibility. we are extremely committed to that goal
- another thing would be a dynamic filter that is supported by an external user/group. we cannot make any guarantees in this case indeed
GPU processing of HDF5 data - Jerome Kieffer
nvcomp: https://developer.nvidia.com/nvcomp
opencl lz4: https://www.silx.org/doc/silx/dev/Tutorials/codec/Bitshuffle-LZ4.html
hdf5plugin does currently switch off OpenMP: when using numpy in parallel, I’d like to use OpenMP nonetheless.
- numpy will not use OpenMP by default
- linalg libs however would use OpenMP by default
hint: great tool for profiling python https://github.com/P403n1x87/austin/ (sample based, interfaces to flamegraph and more)
Processing HDF5 data with FPGAs - Zdenek Matej
- FPGAs are traditionally not used in the photon science data processing
- Zdenek got interested when he saw the potential after a few days hacking
- issues with installation and programming have become much easier and are not a blocking point anymore
- implemented azimuthal integration on FPGA
- BSLZ4 compression on FPGA is still WIP
Sep 19, afternoon session
NetCDF Compression Improvements - Edward Hartnett
- quantization i.e. limiting the precision by setting bits to zero, with zlib works well
- netcdf has a community project for data compression plugins CCR (?)
hdf5plugin: Use HDF5 compression filters from Python - Thomas Vincent
- generic solution for installing compression filters
- discussion topic for tomorrow on where hdf5plugin should be i.e. in h5py or libhdf,
Current and upcoming challenges for data packaging of DECTRIS X-ray detectors - Max Burian et al
- implemented a FileWriter (hdf5 writer) to implement parallel file writing
- limited use of virtual dataset because full use poses problems for processing
- support the NXmx Nexus standard
- virtual datasets do not automatically solve parallel writing issues
- raw data rates will increase >5 up to 200 Gbps of compressed data
- available bandwidth will always be used
- compression algorithms need to be improved continuously for each architecture
- need to save only the data which are scientifically useful
Compression Plugins in h5wasm - Brian Maranville
- good explanation on what h5wasm is and how it is generated with emscripten
- compression filters zstd + lz4 once the right symbols are exported
- possible now to build more plugins, need help from experts in cmake, emscripten, …
Sep 20, morning session
h5cpp and pninexus c++ libraries - Jan Kotanski
- a generic C++ API for HDF5 and Nexus
- pninexus provides an xml builder for generating files following the Nexus structure
- io of nexus files: how to handle writing strings? (in HDF5 there are multiple ways to write strings) How do you handle cases, where people try to read nexus files which where written outside of your ecosystem?
- yes, we try to support them
- additional traits can be provided if string formats mismatch
- sometimes individual handlers need to be provided
- what cpp library do you use for XML handling (read/write)
- use boost xml parser for now
- can STL containers also be used for writing/reading data?
- yes, adapters are provided and are handled by templates inside h5cpp/pninexus
- how does this compare to other c++ bindings of hdf5 libs?
- not sure and never compared
- Elena: all c++ libs/wrappers are different, coverage of hdf5 API is not comprehensive
- Jan: limited to parts of API that are useful for us
openPMD - the open Standard for Particle Mesh Data - Franz Pöschel
a metadata standard and ecosystem which supports HDF5, ADIOS, JSON as binary format
reference implementation is openPMD-api as an open stack for scientific I/O
HELPMI project will enable interoperability between Nexus and openPMD
I/O performance is not scaling anymore for HPC similar to for camera data rates
most files compressed or all are compressed?
- compression only available for ADIOS2 files
- recently implemented chunking HDF5 (and compression subsequently)
mass and charge set as attributes? page 6/33
- charge and mass are arrays, but can be replaced with attributes if the value is constant across all particles of the same species (dito for a mesh, e.g. constant temperature)
- good idea: use attributes for constant properties, datasets for non-constants (PS: not sure I captured this right)
Is OpenPMD specific to geometries, i.e. can you add custom properties of particles?
- standard allows extensions, predefines a small set of common geometries
NexusCreator & iCAT - Helmholtz-Zentrum Berlin applying FAIR data management - Hector Perez Ponce
BESSY beamlines store data in different ways but wip to standardise on NeXus and register them in ICAT
Nexus is an ontology cf. https://nexusformat.org, nxstruct is a structure for defining the layout
using coffeescript (python syntax in javascript) and h5wasm to produce nexus/hdf5 files
supports NOMAD to NeXus/HDF5 conversion, Kafka to Nexus/HDF5, Bluesky to Nexus/HDF5
example on U41-PEAUX RIXS beamline at BESSY
ongoing work is
is there a web application for NexusCreator?
NexusCreator is written in JavaScript effectively?
- yes, due to my background
- is there a web application?
- not at the moment, this is likely planned
for reproducibility: can you go back from nexus to the original?
- idea: in the future processing software will ingest nexus directly
- raw data is stored in the icat to be able to revert for now
at DESY: discussion to extract metadata from iCat; how do you do that at HZB?
- iCat: ???
convert generic nexus file to application definition: what if not all data is available?
- we don’t have everything in fact
- we currently are FAIRready here, so for some parts of the data we are FAIR, for some we are not
- if you are missing required data, not put inside post-hoc
speaker: if people would love to have access to the nexuscreator repo, please send Hector Perez Ponce a message
do you support encryption?
- currently implementing compression
- encryption not available
Is there some validation when data is stored?
- people provide raw data, then we chat, explain nexus standard, write file “manually” together
- plan to automate this? (possibly similar beamlines could be combined)
- yes, considering it
- in the future would be great
comment: in nomad environment, nexus file is validated and checked if file content match
Using Sparse Arrays for Synchrotron 3D-XRD-CT Data Reduction - Jonathan Wright
tool for sample based profiling: https://github.com/benfred/py-spy
how to organize data processing, i.e. the pipelining?
- on slurm cluster: jupyter notebooks (users can connect through a browser with this)
- jupyter notebooks process dataset from memory (requires large enough node)
- users can only touch data after experimental run is finished
- compression slows things down,
could say: don’t do compression then
- crucial: reliable code when experiment is running -> processing can run later
Q: pinning processing to numa nodes came up multiple times now. Just wondering, SLURM can be configured to be NUMA aware, i.e. it would place and pin processes in jobs using cgroups, wouldn’t that be a solution?
- taskset and pinning doesn’t solve all problems
take data from EIGER directly and process directly?
- would be possible
- disk storage is the limiting factor
- could be possible as a LIMA plugin -> human resources
Data Reduction in serial crystallography - Marina Galchenkova
lossless compression does not provide enough compression for SC data
using lossy compression means it is crucial to check the loss in quality of results, resolution is key
reprocessing raw data allows new pipelines to reach higher resolution
saving only the peaks will result in worse resolution
the easiest solution is to save a few significant bits!
need a store more low intensity peaks, saving 8 bit floating like representation
binning + 3 sign bit gives up to 36.3 CR
work has been presented at a few conferences + is implemented at P11 (DESY) and applied to ESRF+APS data
conclusion: concern about data reduction, can you choose the way the data is reduced during the experiment?
- yes and no
- yes: based on existing pipeline, check data quality pipelines
- no: methods like binning
slide on binning, quality scores get better with compressed data, maybe the compression removed noise (when using the EIGER 16M)
- right/wrong detector is tricky to answer
- a lot of details depend on external factors (beam quality) and experiment design (geometry)
- aim: process data and resolve peaks at highest resolution possible, remain flexible for different types of samples
what does binning mean?
- take 2x2 pixels and sum up
how to accumulate the frames?
- [not sure I understood] frame by frame accumulation is used, not a 3D volume as in https://doi.org/10.1107/S2052252514022313
Lossless and Lossy Compression for Photon Science - Peter Steinbach
presented results of summary at the photon sources
3 techniques need much higher compression ratio viz. SX, XPCS, laser shots
signal processing before compression can become key e.g. binning, quantisation,
btune helped improve the parameters for the lossless compression to 3 - 4 CR
LEAPS-INNOV tried to find the best parameters for compression i.e. how to interpret MSSIM
neural compressors impose new challenges because need to store model, sampling distirbution + encoding
did you notice a change in scientists acceptance of lossy compression over the last few years?
- not really, but increasing costs for storage and energy will eventually impose limits which will force scientists to accept lossy compression
Sep 20, afternoon session
HDF Compression for data service architectures - John Readey
- cloud has become a thing which solves some of the issues in the HDF5 library
- bandwidth performance comes from having many requests simultaneously
- can scale by adding more nodes
- contrarily to regular hdf5 library, hsds can compress variable length data
Discussion
Where to put common filters?
List the use cases
- Long term usability (10 years)
What tooling is required?
- How does a data consumer get a hold of missing add-ins?
Improve the error messages when a filter is not available
Which kind of HDF5 Metadata can be exposed to help in the overall Data FAIRisation?
What governance, where, and who?
- Science
- Policy
- Certification
- Change management
- Technical
Availability of plugins in 10 yrs
- Distribution of artifacts (e.g., Francesc Python wheels, .NET Nuget)
Mechanism for a few key filters to have similar support to GZip? (Promotion of some filters)
Participants
- Data producers
- Data consumers
- Filter developers
- Plugin developers
- Integrators (e.g., h5py), 3rd party developers (e.g., MathWorks)
What is a filter vs. a plugin?
- Provide skeleton(s)
Survey of filter popularity?
Testing
- Regression
- Performance
- “Exotic hardware”
Committee suggestions
- Dana R
- Elena P
- Peter Steinbach
- Thomas Vincent
- Commercial? DECTRIS?
- MATLAB
- Francesc Alted
Funding?
- EU open calls next year - EOSC
- Zuckerburg / Amazon? This fall
- NSF
Miscellaneous complaints
- Iterators
- Plugins don’t work directly
- Delta filter structure?
- Filter memory overhaul