datastore

Create datastore for large collections of data

Syntax

``ds = datastore(location)``
``ds = datastore(location,Name,Value)``

Description

````ds = datastore(location)` creates a datastore from the collection of data specified by `location`. A datastore is a repository for collections of data that are too large to fit in memory. After creating `ds`, you can read and process the data.```

````ds = datastore(location,Name,Value)` specifies additional parameters for `ds` using one or more name-value pair arguments. For example, you can create a datastore for image files by specifying `'Type','image'`.```

Examples

Create a datastore associated with the sample file `airlinesmall.csv`. This file contains airline data from the years 1987 through 2008.

To manage the import of missing data in numeric columns, use the `'TreatAsMissing'` name-value pair argument. In this example, specifying the value `'NA'` for `'TreatAsMissing'`, replaces every instance of `'NA'` with a `NaN` in the imported data. Where, `NaN` is the value specified in the `'MissingValue'` property of the datastore.

```ds = datastore('airlinesmall.csv', ... 'TreatAsMissing','NA')```
```ds = TabularTextDatastore with properties: Files: { ' .../devel/bat/Bdoc19b/build/matlab/toolbox/matlab/demos/airlinesmall.csv' } FileEncoding: 'UTF-8' AlternateFileSystemRoots: {} PreserveVariableNames: false ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} DatetimeLocale: en_US Text Format Properties: NumHeaderLines: 0 Delimiter: ',' RowDelimiter: '\r\n' TreatAsMissing: 'NA' MissingValue: NaN Advanced Text Format Properties: TextscanFormats: {'%f', '%f', '%f' ... and 26 more} TextType: 'char' ExponentCharacters: 'eEdD' CommentStyle: '' Whitespace: ' \b\t' MultipleDelimitersAsOne: false Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedFormats: {'%f', '%f', '%f' ... and 26 more} ReadSize: 20000 rows ```

`datastore` creates a `TabularTextDatastore`.

Create a datastore containing all `.tif` files in the MATLAB® path and its subfolders.

```ds = datastore(fullfile(matlabroot, 'toolbox', 'matlab'),... 'IncludeSubfolders', true,'FileExtensions', '.tif','Type', 'image') ```
```ds = ImageDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\example.tif'; ' ...\matlab\toolbox\matlab\imagesci\corn.tif' } ReadSize: 1 Labels: {} ReadFcn: @readDatastoreImage ```

Input Arguments

Files or folders included in the datastore, specified as a path or a `DsFileSet` object.

• path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.

• Local files or folders — Specify `location` as a local path to files or folders. If the files are not in the current folder, then local path must specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.

• Remote files or folders — Specify `location` to be the full paths of the files or folders as a uniform resource locator (URL) of the form `hdfs:///path_to_file`. For more information, see Work with Remote Data.

• `DsFileSet` object — You also can specify `location` as a `DsFileSet` object. For more information, see `matlab.io.datastore.DsFileSet`.

When `location` represents a folder, the datastore includes only supported file formats and ignores any other format. To specify a custom list of file extensions to include in your datastore, see the `FileExtensions` property.

For `KeyValueDatastore`, the files must be MAT-files or Sequence files generated by the `mapreduce` function. MAT-files must be in a local file system or in a network file system. Sequence files can be in a local, network, or HDFS™ file system. For `DatabaseDatastore`, the location argument need not be files. For more information, see `DatabaseDatastore`.

Example: `'file1.csv'`

Example: `'../dir/data/file1.jpg'`

Example: `{'C:\dir\data\file1.xls','C:\dir\data\file2.xlsx'}`

Example: `'C:\dir\data\*.mat'`

Example: `'hdfs:///data/file1.txt'`

Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'FileExtensions',{'.jpg','.tif'}` includes all extensions with a `.jpg` or `.tif` extension for an `ImageDatastore` object.

Type of datastore, specified as the comma-separated pair consisting of `'Type'` and one of the following:

Value of `'Type'`Description
`'tabulartext'`Text files containing tabular data. The encoding of the data must be ASCII or UTF-8.
`'image'`Image files in a format such as JPEG or PNG. Acceptable files include `imformats` formats.
`'spreadsheet'`Spreadsheet files containing one or more sheets.
`'keyvalue'`Key-value pair data contained in MAT-files or Sequence files with data generated by `mapreduce`.
`'file'`Custom format files, which require a specified read function to read the data. For more information, see `FileDatastore`.
`'tall'`MAT-files or Sequence files produced by the `write` function of the `tall` data type. For more information see, `TallDatastore`.
`'parquet'`Parquet files containing column-oriented data. For more information see, `ParquetDatastore`.
`'database'`Data stored in database. Requires Database Toolbox™. Requires specification of additional input argument when using the `type` parameter. For more information, see `DatabaseDatastore`.
• If there are multiple types that support the format of the files, then use the `'Type'` argument to specify a datastore type.

• If you do not specify a value for `'Type'`, then `datastore` automatically determines the appropriate type of datastore to create based on the extensions of the files.

Data Types: `char` | `string`

Include subfolders within a folder, specified as the comma-separated pair consisting of `'IncludeSubfolders'` and `true` (1) or `false` (0). Specify `true` to include all files and subfolders within each folder or `false` to include only the files within each folder.

When you do not specify `'IncludeSubfolders'`, then the default value is `false`.

The `'IncludeSubfolders'` name-value pair is only valid when creating these objects:

• `TabularTextDatastore`

• `ImageDatastore`

• `SpreadsheetDatastore`

• `FileDatastore`

• `KeyValueDatastore`

• `ParquetDatastore`

Example: `'IncludeSubfolders',true`

Data Types: `logical` | `double`

Extensions of files, specified as the comma-separated pair consisting of `'FileExtensions'` and a character vector, cell array of character vectors, string scalar, or string array. When specifying `'FileExtensions'`, also specify `'Type'`. You can use the empty quotes `''` to represent files without extensions.

If `'FileExtensions'` is not specified, then `datastore` automatically includes all supported file extensions depending on the datastore type. If you want to include unsupported extensions, then specify each extension you want to include individually.

• For `TabularTextDatastore` objects, supported extensions include `.txt`, `.csv`, `.dat`, `.dlm`, `.asc`, `.text`, and no extension.

• For `ImageDatastore` objects, supported extensions include all `imformats` extensions.

• For `SpreadsheetDatastore` objects, supported extensions include `.xls`, `.xlsx`, `.xlsm`, `.xltx`, and `.xltm`.

• For `TallDatastore` objects, supported extensions include `.mat` and `.seq`.

• For `ParquetDatastore` objects, supported extensions include `.parquet` and `.parq`.

The `'FileExtensions'` name-value pair is only valid when creating these objects:

• `TabularTextDatastore`

• `ImageDatastore`

• `SpreadsheetDatastore`

• `FileDatastore`

• `KeyValueDatastore`

• `ParquetDatastore`

Example: `'FileExtensions','.jpg'`

Example: `'FileExtensions',{'.txt','.text'}`

Data Types: `char` | `cell` | `string`

Alternate file system root paths, specified as the comma-separated pair consisting of `'AlternateFileSystemRoots'` and a string vector or a cell array. Use `'AlternateFileSystemRoots'` when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use `'AlternateFileSystemRoots'` to associate the root paths.

• To associate a set of root paths that are equivalent to one another, specify `'AlternateFileSystemRoots'` as a string vector. For example,

`["Z:\datasets","/mynetwork/datasets"]`

• To associate multiple sets of root paths that are equivalent for the datastore, specify `'AlternateFileSystemRoots'` as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

• Specify `'AlternateFileSystemRoots'` as a cell array of string vectors.

```{["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}```

• Alternatively, specify `'AlternateFileSystemRoots'` as a cell array of cell array of character vectors.

```{{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}```

The value of `'AlternateFileSystemRoots'` must satisfy these conditions:

• Contains one or more rows, where each row specifies a set of equivalent root paths.

• Each row specifies multiple root paths and each root path must contain at least two characters.

• Root paths are unique and are not subfolders of one another.

• Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: `["Z:\datasets","/mynetwork/datasets"]`

Data Types: `string` | `cell`

Output data type of text variables, specified as the comma-separated pair consisting of `'TextType'` and either `'char'` or `'string'`. If the output table from the `read`, `readall`, or `preview` functions contains text variables, then `'TextType'` specifies the data type of those variables for `TabularTextDatastore` and `SpreadsheetDatastore` objects only. If `'TextType'` is `'char'`, then the output is a cell array of character vectors. If `'TextType'` is `'string'`, then the output has type `string`.

Data Types: `char` | `string`

Type for imported date and time data, specified as the comma-separated pair consisting of `'DatetimeType'` and one of these values: `'datetime'` or `'text'`. The `'DatetimeType'` argument only applies when creating a `TabularTextDatastore` object.

ValueType for Imported Date and Time Data
`'datetime'`

MATLAB `datetime` data type

For more information, see `datetime`.

`'text'`

If `'DatetimeType'` is specified as `'text'`, then the type for imported date and time data depends on the value specified in the `'TextType'` parameter:

• If `'TextType'` is `'char'`, then the `datastore` returns dates as a cell array of character vectors.

• If `'TextType'` is `'string'`, then the `datastore` returns dates as an array of strings.

Example: `'DatetimeType','datetime'`

Data Types: `char` | `string`

Output data type of duration data from text files, specified as the comma-separated pair consisting of `'DurationType'` and either `'duration'` or `'text'`.

ValueType for Imported Duration Data
`'duration'`

MATLAB `duration` data type

For more information, see `duration`.

`'text'`

If `'DurationType'` is specified as `'text'`, then the type for imported duration data depends on the value specified in the `'TextType'` parameter:

• If `'TextType'` is `'char'`, then the importing function returns duration data as a cell array of character vectors.

• If `'TextType'` is `'string'`, then the importing function returns duration data as an array of strings.

Data Types: `char` | `string`

Flag to preserve variable names, specified as the comma-separated pair consisting of `PreserveVariableNames` and either `true`, or `false`.

• `true` — Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.

• `false` — Convert invalid variable names (as determined by the `isvarname` function) to valid MATLAB identifiers.

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the `isvarname` function). To preserve these variable names and row names, set `PreserveVariableNames` to `true`.

In addition to these name-value pairs, you also can specify any of the properties of the following objects as name-value pairs, except for the `Files` property:

Output Arguments

Datastore for a collection of data, returned as one of these objects: `TabularTextDatastore`, `ImageDatastore`, `SpreadsheetDatastore`, `KeyValueDatastore`, `FileDatastore`, `TallDatastore`, `ParquetDatastore`, or `DatabaseDatastore`. The type of the datastore depends on the type of files or the `location` argument. For more information, click the datastore name in the following table:

Type Output
Text files`TabularTextDatastore`
Image files`ImageDatastore`
Spreadsheet files`SpreadsheetDatastore`
MAT-files or Sequence files produced by `mapreduce``KeyValueDatastore`
Custom format files`FileDatastore`
MAT-files or Sequence files produced by the `write` function of the `tall` data type. `TallDatastore`
Parquet Files`ParquetDatastore`
Database`DatabaseDatastore`

For each of these datastore types, the `Files` property is a cell array of character vectors. Each character vector is an absolute path to a file resolved by the `location` argument.