Different runs producing the same structure/data HDF5 file lead to different md5sum hash digests. How to solve it?

3 views (last 30 days)
I have a matlab script which prroduces a HDF55 dataset. The MD5SUM for this fiile changes accross different runs, although the HDF5's content maintains identical... How to provent that?
I supposed it has to do with the creation timestamp that might be stored somewhere in the HDF5 file, but I didn't find it. This is how I am creating and storing values in it:
filename = "cpssm_dataset.h5";
file_path = fullfile(fileparts(mfilename('fullpath')), 'raw', 'cpssm', filename);
file_id = H5F.create(...
file_path, ... % filename
'H5F_ACC_TRUNC', ... % overwrite any existing file
'H5P_DEFAULT', ... % default file‐creation properties
'H5P_DEFAULT' ... % default file‐access properties
);
% Create Data/ group
H5G.create(file_id, '/Data', 'H5P_DEFAULT', 'H5P_DEFAULT', 'H5P_DEFAULT');
% ...
% ...
% ...
group_path = sprintf('/Data/%s', city_name);
H5G.create(file_id, group_path, 'H5P_DEFAULT', 'H5P_DEFAULT', 'H5P_DEFAULT');
h5writeatt(file_path, group_path, 'Name', city_name);
% ...
% ...
% ...
group_path = sprintf('/Data/%s/%s/drift_vel%d/sat%d/%s/%s', city_name, severity, eastward_drift_vel, sat_idx, constellation, freq);
% add data
amplitude = scenario.(freq).amplitude.timeseries_postprop.Var1;
h5create(file_path, group_path + "/amplitude", [1 numel(amplitude)]);
h5write(file_path, group_path + "/amplitude", amplitude.');
% ...
% ...
% ...
That is the md5sum digest message for two different runs
(iono-scint-charact) tapyu@felix-Alienware-m16-R1:~/git/iono-scint-charact/data/raw/cpssm$ md5sum cpssm_dataset.h5
66c3ea9930c1adfeb49d4e15dcfdf018 cpssm_dataset.h5
(iono-scint-charact) tapyu@felix-Alienware-m16-R1:~/git/iono-scint-charact/data/raw/cpssm$ md5sum cpssm_dataset.h5
445738bb297dd50b1f8c69646a487645 cpssm_dataset.h5
They should be the same!
The values are in fact the same as h5dump leads to the same STDOUT.
  4 Comments
Umar
Umar on 8 Sep 2025

Hi @Rubem,

That’s a fair observation — and it points to the subtlety of how HDF5 files are laid out internally.

  • When you compute an MD5 of the raw file bytes, you are sensitive not only to dataset values and attributes, but also to low-level structural details: object header creation times, free-space manager state, alignment padding, or chunk indexing. These can differ across writes even if the logical content is the same. This is why the HDF Group themselves caution that HDF5 files are not bitwise stable.
  • In your Python workflow, it may appear “reliable” because you are likely writing the file with identical settings, in the same session, and without features that introduce non-determinism (e.g. timestamps or chunk free lists). In that constrained situation, two writes may indeed yield identical byte streams — so the MD5 happens to match. But this should be seen as an implementation artifact, not a guarantee of the format.
  • If your goal is to verify that two HDF5 files contain the same scientific data, the robust approach is to compare datasets and attributes programmatically (e.g. using `h5diff` on the command line, or `h5py` in Python).

So in short: yes, sometimes the MD5 will coincide; but it is not a reliable, portable guarantee of equality across environments or HDF5 versions.

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 6 Sep 2025
HDF5 objects have unique object identifiers, but there is no requirement that two HDF5 files written the same way use the same object identifiers.
  2 Comments
Rubem
Rubem on 8 Sep 2025
> there is no requirement that two HDF5 files written the same way use the same object identifiers.
How to enforce that? For instance, in Python, I am managing to generate the same md5sum digest message for HDF5 files containing the same data/structure.

Sign in to comment.

Tags

Products


Release

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!