Set Up Datastore for Processing on Different Machines or Clusters

You can create and save a datastore on a platform that loads and works seamlessly on a different platform by setting up the 'AlternateFileSystemRoots' property of the datastore. Use this property when:

  • You create a datastore on a local machine, and need to access and process the data on another machine (possibly running a different operating system).

  • You process your datastore with parallel and distributed computing involving different platforms, cloud or cluster machines.

This example demonstrates the use of the 'AlternateFileSystemRoots' property for TabularTextDatastore. However, you can use the same syntax for any of these datastores: SpreadsheetDatastore, ImageDatastore, ParquetDatastore, FileDatastore, KeyValueDatastore, and TallDatastore. To use the 'AlternateFileSystemRoots' functionality for custom datastores, see matlab.io.datastore.DsFileSet and Develop Custom Datastore.

Save Datastore and Load on Different File System Platform

Create a datastore on one file system that loads and works seamlessly on a different machine (possibly of a different operating system). For example, create a datastore on a Windows® machine, save it, and then load it on a Linux® machine.

First, before you create and save the datastore, identify the root paths for your data on the different platforms. The root paths will differ based on the machine or file system. For instance, if you have data on your local machine and a copy of the data on a cluster, then get the root paths for accessing the data:

  • "Z:\DataSet" for your local Windows machine.

  • "/nfs-bldg001/DataSet" for your Linux cluster.

Then, associate these root paths by using the 'AlternateFileSystemRoots' parameter of the datastore.

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"];
ds = tabularTextDatastore('Z:\DataSet','AlternateFileSystemRoots',altRoots);

Examine the Files property of datastore. In this instance, the Files property contains the location of your data as accessed by your Windows machine.

ds.Files
ans =

  5×1 cell array

    {'Z:\DataSet\datafile01.csv'}
    {'Z:\DataSet\datafile02.csv'}
    {'Z:\DataSet\datafile03.csv'}
    {'Z:\DataSet\datafile04.csv'}
    {'Z:\DataSet\datafile05.csv'}
Save the datastore.
save ds_saved_on_Windows.mat ds
Load the datastore on a Linux platform and examine the Files property. Since the root path 'Z:\DataSet' is not accessible on the Linux cluster, at load time, the datastore function automatically updates the root paths based on the values specified in the 'AlternateFileSystemRoots' parameter. The Files property of the datastore now contains the updated root paths for your data on the Linux cluster.
load ds_saved_on_Windows.mat
ds.Files
ans =

  5×1 cell array

    {'/nfs-bldg001/DataSet/datafile01.csv'}
    {'/nfs-bldg001/DataSet/datafile02.csv'}
    {'/nfs-bldg001/DataSet/datafile03.csv'}
    {'/nfs-bldg001/DataSet/datafile04.csv'}
    {'/nfs-bldg001/DataSet/datafile05.csv'}
You can now process and analyze this datastore on your Linux machine.

Process Datastore Using Parallel and Distributed Computing

To process your datastore with parallel and distributed computing that involves different platforms, cloud or cluster machines, you must predefine the 'AlternateFileSystemRoots' parameter. This example demonstrates how to create a datastore on your local machine, analyze a small portion of the data, and then use Parallel Computing Toolbox™ and MATLAB® Parallel Server™ to scale up the analysis to the entire dataset.

Create a datastore and assign a value to the 'AlternateFileSystemRoots' property. To set the value for the 'AlternateFileSystemRoots' property, identify the root paths for your data on the different platforms. The root paths differ based on the machine or file system. For example, identify the root paths for data access from your machine and your cluster:

  • "Z:\DataSet" from your local Windows Machine.

  • "/nfs-bldg001/DataSet" from the MATLAB Parallel Server Linux Cluster.

Then, associate these root paths using the AlternateFileSystemRoots property.

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"];
ds = tabularTextDatastore('Z:\DataSet','AlternateFileSystemRoots',altRoots);

Analyze a small portion of the data on your local machine. For instance, get a partitioned subset of the data, clean the data by removing any missing entries, and examine a plot of the variables.

tt = tall(partition(ds,100,1)); 
summary(tt); 
% analyze your data                        
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)     

Scale up your analysis to the entire dataset by using MATLAB Parallel Server cluster (Linux cluster). For instance, start a worker pool using the cluster profile, and then perform analysis on the entire dataset by using parallel and distributed computing capabilities.

parpool('MyMjsProfile') 
tt = tall(ds);          
summary(tt);
% analyze your data
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

See Also

| | | | | |

Related Topics