Different runs producing the same structure/data HDF5 file lead to different md5sum hash digests. How to solve it?

Question

Rubem on 6 Sep 2025

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/2179826-different-runs-producing-the-same-structure-data-hdf5-file-lead-to-different-md5sum-hash-digests-h

Commented: Walter Roberson on 8 Sep 2025

I have a matlab script which prroduces a HDF55 dataset. The MD5SUM for this fiile changes accross different runs, although the HDF5's content maintains identical... How to provent that?

I supposed it has to do with the creation timestamp that might be stored somewhere in the HDF5 file, but I didn't find it. This is how I am creating and storing values in it:

filename = "cpssm_dataset.h5";
file_path = fullfile(fileparts(mfilename('fullpath')), 'raw', 'cpssm', filename);
file_id = H5F.create(...
    file_path, ...           % filename
    'H5F_ACC_TRUNC', ...     % overwrite any existing file
    'H5P_DEFAULT', ...       % default file‐creation properties
    'H5P_DEFAULT' ...        % default file‐access properties
    );
% Create Data/ group
H5G.create(file_id, '/Data', 'H5P_DEFAULT', 'H5P_DEFAULT', 'H5P_DEFAULT');
% ...
% ...
% ...
group_path = sprintf('/Data/%s', city_name);
H5G.create(file_id, group_path, 'H5P_DEFAULT', 'H5P_DEFAULT', 'H5P_DEFAULT');
h5writeatt(file_path, group_path, 'Name', city_name);
% ...
% ...
% ...
group_path = sprintf('/Data/%s/%s/drift_vel%d/sat%d/%s/%s', city_name, severity, eastward_drift_vel, sat_idx, constellation, freq);
% add data
amplitude = scenario.(freq).amplitude.timeseries_postprop.Var1;
h5create(file_path, group_path + "/amplitude", [1 numel(amplitude)]);
h5write(file_path, group_path + "/amplitude", amplitude.');
% ...
% ...
% ...

That is the md5sum digest message for two different runs

(iono-scint-charact) tapyu@felix-Alienware-m16-R1:~/git/iono-scint-charact/data/raw/cpssm$ md5sum cpssm_dataset.h5 
66c3ea9930c1adfeb49d4e15dcfdf018  cpssm_dataset.h5
(iono-scint-charact) tapyu@felix-Alienware-m16-R1:~/git/iono-scint-charact/data/raw/cpssm$ md5sum cpssm_dataset.h5 
445738bb297dd50b1f8c69646a487645  cpssm_dataset.h5

They should be the same!

The values are in fact the same as h5dump leads to the same STDOUT.

4 Comments
Show 2 older commentsHide 2 older comments

Umar on 8 Sep 2025

Hi @Rubem,

That’s a fair observation — and it points to the subtlety of how HDF5 files are laid out internally.

When you compute an MD5 of the raw file bytes, you are sensitive not only to dataset values and attributes, but also to low-level structural details: object header creation times, free-space manager state, alignment padding, or chunk indexing. These can differ across writes even if the logical content is the same. This is why the HDF Group themselves caution that HDF5 files are not bitwise stable.

In your Python workflow, it may appear “reliable” because you are likely writing the file with identical settings, in the same session, and without features that introduce non-determinism (e.g. timestamps or chunk free lists). In that constrained situation, two writes may indeed yield identical byte streams — so the MD5 happens to match. But this should be seen as an implementation artifact, not a guarantee of the format.

If your goal is to verify that two HDF5 files contain the same scientific data, the robust approach is to compare datasets and attributes programmatically (e.g. using `h5diff` on the command line, or `h5py` in Python).

So in short: yes, sometimes the MD5 will coincide; but it is not a reliable, portable guarantee of equality across environments or HDF5 versions.

Rubem on 8 Sep 2025

Thank you very much for this clarification!

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 6 Sep 2025

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/2179826-different-runs-producing-the-same-structure-data-hdf5-file-lead-to-different-md5sum-hash-digests-h#answer_1570119

HDF5 objects have unique object identifiers, but there is no requirement that two HDF5 files written the same way use the same object identifiers.

2 Comments
Show NoneHide None

Rubem on 8 Sep 2025

> there is no requirement that two HDF5 files written the same way use the same object identifiers.

How to enforce that? For instance, in Python, I am managing to generate the same md5sum digest message for HDF5 files containing the same data/structure.

Walter Roberson on 8 Sep 2025

Mathworks does not provide any way to ensure that the same object identifiers are used.

Sign in to comment.

Different runs producing the same structure/data HDF5 file lead to different md5sum hash digests. How to solve it?

4 Comments
Show 2 older commentsHide 2 older comments

Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Different runs producing the same structure/data HDF5 file lead to different md5sum hash digests. How to solve it?

4 Comments Show 2 older commentsHide 2 older comments

Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

2 Comments
Show NoneHide None