The European XFEL is an X-ray laser research facility that produces extremely short and intense X-ray flashes, enabling investigations across a wide range of fields—from the structure of matter to the dynamic evolution of molecular systems. A typical experiment can generate petabytes of data within a day, originating from diverse detectors and in multiple formats. Managing this...
ITER (International Thermonuclear Experimental Reactor) is the largest international experiment in the field of generating nuclear fusion energy by magnetic confinement. ITER’s objective is to operate in modes that come as close as possible to the conditions of a commercial fusion reactor, which implies long pulses and systems running continuously.
From the point of view of the data...
HDF5 is the de facto standard for storing large volumes of binary data in files. Blosc2, an award-winning high-performance library, excels at compressing binary data in memory. Both are widely used, making their integration natural. This talk will cover using Blosc2 as an HDF5 filter and HDF5 as a Blosc2 backend.
We will outline the current state of the Blosc2 plugin for HDF5...
The integration of the HDF5 high-level IO library, encompassing Single Writer Multiple Reader (SWMR) and parallel mode (pHDF5), with Pure Storage FlashBlade using NFS, modern Linux Kernels, and Networks, offers a robust and high-performance solution. This combination significantly reduces Total Cost of Ownership (TCO) and technical debt compared to traditional parallel file systems.
HDF5,...
In this talk, we will present an overview and demonstration of two HDF5 features implemented via virtual file drivers (VFDs), both currently in a prototype stage: the full single-writer/multiple-reader capability (VFD SWMR) and HDF5 versioning (also known as the Onion VFD).
The VFD SWMR feature enables file modifications during writing and provides guarantees on the maximum latency before new...
When people hear this affirmation, a common reaction is, "When, HDF Group, when?" In her book How to Make Sense of Any Mess, Abby Covert reminds us to be careful and not fall in love with our plans or ideas but with the effects we can have when we communicate clearly. While I would reject the notion that "a mess" aptly describes the current state of HDF5, I want to use this short...
The [NDE File Format][1] (.nde), developed by Evident, is an open, extensible data format tailored for the non-destructive evaluation (NDE) and testing (NDT) industry. Built upon the HDF5 container and augmented with JSON-based metadata, it offers a platform-independent solution for storing inspection data, primarily for ultrasonic modality. By adopting an open format, .nde files can be...
Version 3 of the Zarr specification includes a sharding codec that allows for chunks to contain small inner chunks. The format of the resulting binary file format of shards is reminiscent of a HDF5 file. Both HDF5 files and Zarr v3 shards may contain compressed chunks. Furthermore, the Zarr v3 shard specification is similar to the Fixed Array Data Block structure within a HDF5 file....
GPUs and similar accelerators have become the dominant compute platform for STEM applications, from finance to space flight, and beyond. However, HDF5 continues to execute exclusively on the Host CPU of GPU nodes. This talk will present a design overview of moving the I/O pipeline filters, datatype conversions and other transforms from the CPU to the GPU, including how to perform I/O...
This talk will provide an overview of the current state of the HDF5 software ecosystem, highlighting recent advancements, key components, and ongoing challenges. We will explore the future directions of the various tools and libraries that empower researchers and developers to manage HDF5-related workflows efficiently. Additionally, the talk will outline potential future initiatives for the...
MAX IV is an accelerator-based light source located in southern Sweden. It continuously operates 16 experimental stations using X-rays for material and life sciences. User data analysis primarily occurs on a small edge HPC, while automatic scientific data processing, from around 100 data sources, runs in an edge-cloud environment. These pipelines provide rapid feedback on data acquisition and...
HDF5 is an enormously powerful and flexible file format. There are many different ways to use it, and it's difficult to provide one API that works efficiently for all the possible use cases. However, the complexity of the on-disk file format is a high barrier to alternative implementations, so with a few heroic exceptions, most code reading & writing HDF5 does so through the canonical C...
h5pydantic is a Pydantic based library aimed at making it easier for scientists to organise their HDF5 files, by writing Python models of their experiments. The library is similar to an Object Relational Mapper (ORM), but instead of targeting a relational database, it targets HDF.
The library is inspired by the need of the Australian Synchrotron during the commissioning of our new beamlines...
HSDS (Highly Scalable Data Service) is a REST-based service that provides read/write access to HDF5 data stores – using object storage or posix. By using a combination of multi-processing and asynchronous IO, HSDS can achieve remarkable performance when accessing very large datasets. On the other hand, performance lagged for clients invoking a series of smaller requests (reading or writing a...
This contribution presents our experience using a pure Java implementation of the HDF5 file format to support metadata collection at the P05 beamline at Hereon, DESY. Since 2016, we have relied on Java-based solutions to generate HDF5 files for hundreds of experiments with minimal maintenance overhead.
We will provide a practical overview of how to use the Java HDF5 library effectively in...
Over the past few years, Lifeboat LLC has been focused on advancing the capabilities of HDF5 by incorporating multi-threaded support, enhancing the storage of sparse and variable-length data, and implementing robust encryption mechanisms for data stored within HDF5 files. These improvements are aimed at optimizing performance, increasing flexibility, and strengthening data security.
In our...
Working with HDF5 often means navigating large datasets, verbose APIs, and boilerplate-heavy code. In this demo-heavy session, we’ll explore how AI-assisted coding tools—specifically GitHub Copilot—can accelerate common HDF5 workflows across C, C++, and Python. From auto-generating read/write boilerplate, to documenting complex structures, to scaffolding tests and data conversion routines,...
As object storage becomes even more prevalent, HDF5's underlying storage format needs to updated to match the interface that cloud and on-prem object systems provide. This talk will present a design overview of a mapping of the HDF5 data model onto S3-compatible storage systems. An outline of the planned VOL connector implementation and projected performance goals will be part of the talk.
A very common need when presented with an HDF5 file (especially for non-programmers or those new to HDF5), is some way to “see” the contents. Happily, there are a variety of viewers for HDF5 data sources available: HDFView, H5Web (myhdf5), HDF Compass, etc. However, it’s not obvious how these compare or when one or another might be preferable for a particular application. In this session,...
Modern science and engineering creates and accumulates huge amounts of data which is
persisted through tools like HDF5 in order to be available for further analysis, display
and many other operations. Increasing efficiency in this data processing is critical for
nowadays growing data quantities, not only for saving time, but also to efficiently use
available resources.
This thesis aimed...
With the growing adoption of the Zarr data format for scalable and cloud-optimized storage, NetCDF has introduced an interface to support Zarr access. This integration enables a broader range of scientific software, beyond the Python ecosystem, to interact with Zarr datasets through the familiar NetCDF API. In this presentation, we will discuss the current state of the NetCDF-Zarr...
Scientific workflows are evolving rapidly, demanding the seamless integration of simulation, experiments, analytics, and AI. This evolution is placing immense pressure on traditional data management systems. To address these challenges, we present IOwarp, a new initiative focused on building a comprehensive data management platform. IOwarp aims to streamline complex scientific workflows by...
Rapid adoption of artificial intelligence (AI) in scientific computing requires new tools to evaluate I/O performance effectively. HDF5 is one of the data formats frequently used not only in HPC but also in modern AI applications. However, existing benchmarks are insufficient to address the current challenges posed by AI workloads. This talk introduces an extension to the existing HDF5...