Main Content

Work with Cloud-Optimized HDF5 Files

Since R2025a

In MATLAB®, you can optimize the performance of HDF5 files in the cloud. A cloud-optimized HDF5 file (CO HDF5 file) is an HDF5 file whose internal structures have been arranged for more efficient data access when the file is in cloud object stores. The goal of these arrangements is to reduce the number of server requests, which is often the limiting factor in the performance of HDF5 files in the cloud. The high-level (h5read and h5readatt) and low-level (H5D.read and H5A.read) reading functions in the MATLAB HDF5 interface, by default, attempt to take advantage of these optimized internal structures.

This topic describes two strategies for creating CO HDF5 files using the MATLAB low-level HDF5 interface: avoiding variable-length datatypes and consolidating internal file metadata. These strategies can be used together or separately.

Note

Consolidating the metadata in an HDF5 file can improve the file's performance even when you access the file locally. The performance improvement is most noticeable for metadata-intensive operations, such as calls to the h5info function.

Strategies for Creating Cloud-Optimized HDF5 Files

Avoid Variable-Length Datatypes

One strategy for creating a CO HDF5 file is to avoid using variable-length datatypes in the file. Although variable-length datatypes can offer modest space savings, they often fit irregularly on file pages. This irregularity can increase the number of server requests necessary to access the data in the file, adversely affecting performance.

If you do use variable-length datatypes in the file, consider using as few as possible.

Consolidate Internal File Metadata

Another strategy for creating a CO HDF5 file is to consolidate the internal file metadata. Use these consolidation techniques:

  • Choose a paged aggregation strategy for handling file space in the file. To do so, specify the strategy argument of the H5P.set_file_space_strategy function as "H5F_FSPACE_STRATEGY_PAGE".

  • Choose a page size that allows all the internal metadata of the file to fit on a single page. A size of 8 MiB works well for many files. You can set the page size using the fsp_size argument of the H5P.set_file_space_page_size function.

  • Choose a chunk cache size between 1/8 and 1/4 of the page size. For example, if you set the page size to 8 MiB, then set the size of the chunk cache to be between 1 MiB and 2 MiB. You can set the size of the chunk cache using the nbytes argument of the H5P.set_chunk_cache function. When calling this function, consider specifying the nslots argument as a prime number close to 100. For example, you can specify nslots as 101.

  • If you have user-specific information to store in the file, then consider setting the size of the user block. Choose a user block size large enough to store all the user-specific information that the file contains. You can set the size of the user block using the size argument of the H5P.set_userblock function.

  • If the file will be written in more than one session, then consider enabling persistent free space in the file. To enable persistent free space, specify the is_persist argument of the H5P.set_file_space_strategy function as true.

If you do not or cannot use a paged aggregation strategy, then, instead of setting the page size and the chunk cache size, consider creating the file with a large metadata block. Choose a metadata block size that allows all the internal metadata of the file to fit in a single block. A size of 8 MiB works well for many files. You can set the size of the metadata block using the size argument of the H5P.set_meta_block_size function. For example:

size = 8*2^20;
H5P.set_meta_block_size(faplID,size)

Create Cloud-Optimized HDF5 File

This example shows how to create a cloud-optimized HDF5 file by using metadata consolidation techniques. You first create and configure an HDF5 file and then consolidate internal file metadata by setting a paged aggregation strategy, page size, chunk cache size, and user block size.

First, prepare an HDF5 file by defining property lists, creating the file, creating a simple dataspace with the size of each dimension unlimited, and setting the chunk size.

fcplID = H5P.create("H5P_FILE_CREATE");
faplID = H5P.create("H5P_FILE_ACCESS");

H5P.set_libver_bounds(faplID,"H5F_LIBVER_LATEST","H5F_LIBVER_LATEST")
fileID = H5F.create("myfile.h5","H5F_ACC_TRUNC",fcplID,faplID);
typeID = H5T.copy("H5T_NATIVE_DOUBLE");

unlimited = H5ML.get_constant_value("H5S_UNLIMITED");
dims = [512 1024];
h5_dims = fliplr(dims);
h5_maxdims = [unlimited unlimited];
spaceID = H5S.create_simple(2,h5_dims,h5_maxdims);
dcplID = H5P.create("H5P_DATASET_CREATE");
daplID = H5P.create("H5P_DATASET_ACCESS");

chunkDims = [512 1024];
h5_chunkDims = fliplr(chunkDims);
H5P.set_chunk(dcplID,h5_chunkDims)

Choose a paged aggregation strategy for handling file space in the file, and enable persistent free space in the file.

strategy = "H5F_FSPACE_STRATEGY_PAGE";
is_persist = true;
threshold = 1; % Allow free-space manager to track all free space
H5P.set_file_space_strategy(fcplID,strategy,is_persist,threshold)

Set the page size in the file to 8 MiB.

fsp_size = 8*2^20;
H5P.set_file_space_page_size(fcplID,fsp_size)

Set the size of the chunk cache for the dataset to 2 MiB, and use 101 slots in the chunk cache for the dataset.

nbytes = 2*2^20;
nslots = 101;
w0 = 0.75; % Library default value
H5P.set_chunk_cache(daplID,nslots,nbytes,w0)

Set the size of the user block in the file to 2 KiB.

size = 2*2^10;
H5P.set_userblock(fcplID,size)

Create the dataset.

dsID = H5D.create(fileID,"DS1",typeID,spaceID,dcplID);

Close all identifiers.

H5D.close(dsID)
H5P.close(daplID)
H5P.close(dcplID)
H5S.close(spaceID)
H5T.close(typeID)
H5F.close(fileID)
H5P.close(faplID)
H5P.close(fcplID)

See Also

| | | | | | | | | | |