Work with Cloud-Optimized HDF5 Files
In MATLAB®, you can optimize the performance of HDF5 files in the cloud. A
cloud-optimized HDF5 file (CO HDF5 file) is an HDF5 file
whose internal structures have been arranged for more efficient data access when the
file is in cloud object stores. The goal of these arrangements is to reduce the number
of server requests, which is often the limiting factor in the performance of HDF5 files
in the cloud. The high-level (h5read
and h5readatt
) and low-level (H5D.read
and H5A.read
)
reading functions in the MATLAB HDF5 interface, by default, attempt to take advantage of these optimized
internal structures.
This topic describes two strategies for creating CO HDF5 files using the MATLAB low-level HDF5 interface: avoiding variable-length datatypes and consolidating internal file metadata. These strategies can be used together or separately.
Note
Consolidating the metadata in an HDF5 file can improve the file's performance even
when you access the file locally. The performance improvement is most noticeable for
metadata-intensive operations, such as calls to the h5info
function.
Strategies for Creating Cloud-Optimized HDF5 Files
Avoid Variable-Length Datatypes
One strategy for creating a CO HDF5 file is to avoid using variable-length datatypes in the file. Although variable-length datatypes can offer modest space savings, they often fit irregularly on file pages. This irregularity can increase the number of server requests necessary to access the data in the file, adversely affecting performance.
If you do use variable-length datatypes in the file, consider using as few as possible.
Consolidate Internal File Metadata
Another strategy for creating a CO HDF5 file is to consolidate the internal file metadata. Use these consolidation techniques:
Choose a paged aggregation strategy for handling file space in the file. To do so, specify the
strategy
argument of theH5P.set_file_space_strategy
function as"H5F_FSPACE_STRATEGY_PAGE"
.Choose a page size that allows all the internal metadata of the file to fit on a single page. A size of 8 MiB works well for many files. You can set the page size using the
fsp_size
argument of theH5P.set_file_space_page_size
function.Choose a chunk cache size between 1/8 and 1/4 of the page size. For example, if you set the page size to 8 MiB, then set the size of the chunk cache to be between 1 MiB and 2 MiB. You can set the size of the chunk cache using the
nbytes
argument of theH5P.set_chunk_cache
function. When calling this function, consider specifying thenslots
argument as a prime number close to 100. For example, you can specifynslots
as101
.If you have user-specific information to store in the file, then consider setting the size of the user block. Choose a user block size large enough to store all the user-specific information that the file contains. You can set the size of the user block using the
size
argument of theH5P.set_userblock
function.If the file will be written in more than one session, then consider enabling persistent free space in the file. To enable persistent free space, specify the
is_persist
argument of theH5P.set_file_space_strategy
function astrue
.
If you do not or cannot use a paged aggregation strategy, then, instead of
setting the page size and the chunk cache size, consider creating the file with
a large metadata block. Choose a metadata block size that allows all the
internal metadata of the file to fit in a single block. A size of 8 MiB works
well for many files. You can set the size of the metadata block using the
size
argument of the H5P.set_meta_block_size
function. For
example:
size = 8*2^20; H5P.set_meta_block_size(faplID,size)
Create Cloud-Optimized HDF5 File
This example shows how to create a cloud-optimized HDF5 file by using metadata consolidation techniques. You first create and configure an HDF5 file and then consolidate internal file metadata by setting a paged aggregation strategy, page size, chunk cache size, and user block size.
First, prepare an HDF5 file by defining property lists, creating the file, creating a simple dataspace with the size of each dimension unlimited, and setting the chunk size.
fcplID = H5P.create("H5P_FILE_CREATE"); faplID = H5P.create("H5P_FILE_ACCESS"); H5P.set_libver_bounds(faplID,"H5F_LIBVER_LATEST","H5F_LIBVER_LATEST") fileID = H5F.create("myfile.h5","H5F_ACC_TRUNC",fcplID,faplID); typeID = H5T.copy("H5T_NATIVE_DOUBLE"); unlimited = H5ML.get_constant_value("H5S_UNLIMITED"); dims = [512 1024]; h5_dims = fliplr(dims); h5_maxdims = [unlimited unlimited]; spaceID = H5S.create_simple(2,h5_dims,h5_maxdims); dcplID = H5P.create("H5P_DATASET_CREATE"); daplID = H5P.create("H5P_DATASET_ACCESS"); chunkDims = [512 1024]; h5_chunkDims = fliplr(chunkDims); H5P.set_chunk(dcplID,h5_chunkDims)
Choose a paged aggregation strategy for handling file space in the file, and enable persistent free space in the file.
strategy = "H5F_FSPACE_STRATEGY_PAGE"; is_persist = true; threshold = 1; % Allow free-space manager to track all free space H5P.set_file_space_strategy(fcplID,strategy,is_persist,threshold)
Set the page size in the file to 8 MiB.
fsp_size = 8*2^20; H5P.set_file_space_page_size(fcplID,fsp_size)
Set the size of the chunk cache for the dataset to 2 MiB, and use 101 slots in the chunk cache for the dataset.
nbytes = 2*2^20;
nslots = 101;
w0 = 0.75; % Library default value
H5P.set_chunk_cache(daplID,nslots,nbytes,w0)
Set the size of the user block in the file to 2 KiB.
size = 2*2^10; H5P.set_userblock(fcplID,size)
Create the dataset.
dsID = H5D.create(fileID,"DS1",typeID,spaceID,dcplID);
Close all identifiers.
H5D.close(dsID) H5P.close(daplID) H5P.close(dcplID) H5S.close(spaceID) H5T.close(typeID) H5F.close(fileID) H5P.close(faplID) H5P.close(fcplID)
See Also
h5info
| h5read
| h5readatt
| H5A.read
| H5D.create
| H5D.read
| H5P.create
| H5P.set_chunk_cache
| H5P.set_file_space_page_size
| H5P.set_file_space_strategy
| H5P.set_meta_block_size
| H5P.set_userblock