Develop Custom Tall Array Algorithms

Tall arrays are a powerful, intuitive way to work with large data sets using traditional MATLAB® syntax. However, since tall arrays operate on blocks of the data, each of which individually fits into memory, the traditional algorithms of most functions need to be updated to use a parallelized approach to support tall arrays. This topic shows you how to develop your own parallelized algorithms to operate on tall arrays.

Currently available approaches for applying custom functions to tall arrays are:

Regardless of which operation you choose, there are options, performance considerations, and common issues that apply to all approaches.

Reasons to Implement Custom Algorithms

Most common mathematical functions and MATLAB operations already support tall arrays. If the functionality is already supported, then writing your own algorithm might not be necessary.

Here are some reasons why you might want to implement a custom algorithm for tall arrays:

  • Implement Currently Unsupported Functions — If a particular function does not currently support tall arrays, then you can use the APIs outlined here to write a version of that function that supports tall arrays.

  • Leverage Existing Code — If you have existing code that performs some operations on in-memory data, then with only minor modifications you can make it compatible to operate on tall arrays. This approach avoids the need to convert the code to fit the subset of the MATLAB language that supports tall arrays.

  • Gain Performance — For example, you can rewrite a MATLAB function as a C++ MEX function, and then you can use the APIs outlined here to call the MEX function to operate on the data.

  • Use a Preferred External Library — For compatibility within your organization it is sometimes required to use a specific external library for certain calculations. You can use the APIs outlined here to reimplement a function with those external libraries.

Supported APIs

The supported APIs are intended for advanced use and do not include extensive input checking. Expect to spend some time testing that the supplemental functions you implement satisfy all of the requirements and perform the calculations you expect. Currently supported APIs for authoring tall array algorithms are listed here.

Package Function NameDescription
matlab.tall.transformApply a specified function to each block of one or more tall arrays.
matlab.tall.reduceApply a specified function to each block of one or more tall arrays. Then feed the output of that function into a second reduction function.
matlab.tall.movingWindowApply moving window function to blocks of data.
matlab.tall.blockMovingWindow

Apply moving window function and block reduction to padded blocks of data.

Background: Tall Array Blocks

When you create a tall array from a datastore, the underlying datastore facilitates the movement of data during a calculation. The data moves in discrete pieces called blocks, where each block is a set of consecutive rows that can fit in memory. For example, one block of a 2-D array (such as a table) is X(n:m,:). The size of each block is based on the value of the ReadSize property of the datastore, but the block is not always that exact size. For the purposes of developing tall array algorithms, a tall array is considered to be the vertical concatenation of many such blocks.

The blocks of a given array are chosen at run-time based on available memory, so they can be dynamic. Therefore, the blocks might not be exactly the same size between runs. If you have changes on your computer that affect the available memory, then that can impact the size of the blocks.

Although this page refers only to blocks and rows in a 2-D sense, these concepts extend to N-D tall arrays. The block size is only constrained in the first dimension, so the block includes all elements in other dimensions; for example, X(n:m,:,:,...). Also, rather than rows, N-D arrays have slices such as X(p,:,:,...).

Single-Step Transformation Operation

The matlab.tall.transform function applies a single function to each block of a tall array, so you can use it to apply a block-wise transformation, filtering, or reduction of the data. For example, you can remove rows with specific values, center and scale the data, or detect certain conditions and transform specific pieces of data. These figures show what happens to the blocks in an array when they are operated on by matlab.tall.transform.

Operation

Description

Examples

Transformation — The number of rows in each block remains the same, but the values change.

  • A = matlab.tall.transform(@sin, tX) calculates the sine of the elements in each block.

  • A = matlab.tall.transform(@(X) X.^2, tX) squares the elements in each block.

Filtering — The number of rows in each block are reduced, so the blocks in the new array might include rows originally present in other blocks.

  • A = matlab.tall.transform(@(X) topkrows(X,5), tX) extracts only the top 5 rows from each block, filtering out the other rows.

  • A = matlab.tall.transform(@sum, tX) calculates the sum of the elements in each block, which reduces each block to a scalar. The number of elements in A is equal to the number of blocks.

Transform Syntax

The generic syntax to apply a single-step transform is

[tA, tB, tC, ...] = matlab.tall.transform(fcn, tX, tY, tZ, ...)

Functional Requirements for fcn

The general functional signature of fcn is

[a, b, c, ...] = fcn(x, y, z, ...)
fcn must satisfy these requirements:

  1. Input Arguments — The inputs [x, y, z, ...] are blocks of data that fit in memory. The blocks are produced by extracting data from the respective tall array inputs [tX, tY, tZ, ...]. The inputs [x, y, z, ...] satisfy these properties:

    • All of [x, y, z, ...] have the same size in the first dimension after any allowed expansion.

    • The blocks of data in [x, y, z, ...] come from the same index in the tall dimension, assuming the tall array is nonsingleton in the tall dimension. For example, if tX and tY are nonsingleton in the tall dimension, then the first set of blocks might be x = tX(1:20000,:) and y = tY(1:20000,:).

    • If the first dimension of any of [tX, tY, tZ, ...] has a size of 1, then the corresponding block [x, y, z, ...] consists of all the data in that tall array.

  2. Output Arguments — The outputs [a, b, c, ...] are blocks that fit in memory, to be sent to the respective outputs [tA, tB, tC, ...]. The outputs [a, b, c, ...] satisfy these properties:

    • All of [a, b, c, ...] must have the same size in the first dimension.

    • All of [a, b, c, ...] are vertically concatenated with the respective results of previous calls to fcn.

    • All of [a, b, c, ...] are sent to the same index in the first dimension in their respective destination output arrays.

  3. Functional Rulesfcn must satisfy the functional rule:

    • F([inputs1; inputs2]) == [F(inputs1); F(inputs2)]: Applying the function to the concatenation of the inputs should be the same as applying the function to the inputs separately and then concatenating the results.

  4. Empty Inputs — Ensure that fcn can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data.

Two-Step Reduction Operation

matlab.tall.reduce applies two functions to a tall array, with the result of the first step being fed in as input to a final reduction step. The reduction function is applied repeatedly to the intermediate results until a single final block that fits in memory is obtained. In the MapReduce paradigm, this process is similar to a "single key" MapReduce operation, where the intermediate results all have the same key and are combined in the reduction step.

The first step is similar to matlab.tall.transform and has the same requirements. However, the reduction step always reduces the intermediate results down to a single block that fits in memory. These figures show what happens to the blocks in an array when they are operated on by matlab.tall.reduce.

Operation

Description

Examples

Transformation + Reduction — The number of rows in each block remains the same after the first step, and then the intermediate results are reduced to one block.

  • A = matlab.tall.reduce(@sin,@max,tX) calculates the sine of each block of values, and then it finds the overall maximum value during the reduction step.

  • A = matlab.tall.reduce(@(X) X.^2, @mean, tX) squares the elements in each block, and then it calculates the overall mean in the reduction step.

Filtering + Reduction — The number of rows in each block are reduced in the first step. Then the intermediate results are reduced to one block.

  • A = matlab.tall.reduce(@sum, @sum, tX) calculates the sum of the elements in each block, and then it finds the overall sum of elements in the reduction step.

  • A = matlab.tall.reduce(@(X) X(X>0), @mean, tX) filters out all negative values, and then it calculates the overall mean of the remaining values.

Reduce Syntax

The generic syntax to apply a two-step reduction is

[rA, rB, rC, ...] = matlab.tall.reduce(fcn, reducefcn, tX, tY, tZ, ...)

The functional signature of fcn is

[a, b, c, ...] = fcn(x, y, z, ...)

The functional signature of reducefcn is

[rA, rB, rC, ...] = reducefcn(a, b, c, ...)

That is, the input tall arrays [tX, tY, tZ, ...] are broken into blocks [x, y, z, ...] that are inputs to fcn. Then, fcn returns outputs [a, b, c, ...] that are inputs to reducefcn. Finally, reducefcn returns the final results [rA, rB, rC] that are returned by matlab.tall.reduce.

Functional Requirements for reducefcn

The requirements for fcn are the same as those that were outlined in Functional Requirements for fcn. However, the requirements for reducefcn are different.

The general functional signature of reducefcn is

[rA, rB, rC, ...] = reducefcn(a, b, c, ...)
reducefcn must satisfy these requirements:

  1. Input Arguments — The inputs [a, b, c, ...] are blocks that fit in memory. The blocks of data are either outputs returned by fcn, or a partially reduced output from reducefcn that is being operated on again for further reduction. The inputs [a, b, c, ...] satisfy these properties:

    • The inputs [a, b, c, ...] have the same size in the first dimension.

    • For a given index in the first dimension, every row of the blocks of data [a, b, c, ...] either originates from the input, or originates from the same previous call to reducefcn.

    • For a given index in the first dimension, every row of the inputs [a, b, c, ...] for that index originates from the same index in the first dimension.

  2. Output Arguments — All outputs [rA, rB, rC, ...] must have the same size in the first dimension. Additionally, they must be vertically concatenable with the respective inputs [a, b, c, ...] to allow for repeated reductions when necessary.

  3. Functional Rulesreducefcn must satisfy these functional rules (up to roundoff error):

    • F(input) == F(F(input)): Applying the function repeatedly to the same inputs should not change the result.

    • F([input1; input2]) == F([input2; input1]): The result should not depend on the order of concatenation.

    • F([input1; input2]) == F([F(input1); F(input2)]): Applying the function once to the concatenation of some intermediate results should be the same as applying it separately, concatenating, and applying it again.

  4. Empty Inputs — Ensure that reducefcn can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data. For this call, all input blocks are empty arrays of the correct type and size in dimensions beyond the first.

Sliding-Window Operations

The matlab.tall.movingWindow and matlab.tall.blockMovingWindow functions apply a function to windows of data within a tall array. While matlab.tall.transform and matlab.tall.reduce operate on entire blocks of data at a time, the moving-window functions operate on windows of data. The windows can span between the blocks of data being read from disk.

These figures show what happens to the blocks in an array when they are operated on by matlab.tall.movingWindow or matlab.tall.blockMovingWindow.

OperationDescriptionExamples

Windowed Transformation — The number of rows in each block remains the same, but the values change. The output contains the results of operations performed on both incomplete and complete windows of data.

Both matlab.tall.movingWindow and matlab.tall.blockMovingWindow transform data when 'EndPoints' is the default value of 'shrink', or when a fill value is specified. Both values ensure the output is the same size in the first dimension as the input.

  • A = matlab.tall.movingWindow(@mean, 100, tX) calculates a moving mean using a window size of 100.

Windowed Filtering — Incomplete windows of data are discarded, so the output has fewer elements than the input. The output only contains the results of operations performed on complete windows of data.

Both matlab.tall.movingWindow and matlab.tall.blockMovingWindow remove incomplete windows of data when 'EndPoints' is 'discard'.

  • A = matlab.tall.movingWindow(@mean, 100, tX, 'EndPoints', 'discard') calculates a moving mean on complete windows of data, using a window size of 100.

You can use matlab.tall.movingWindow and matlab.tall.blockMovingWindow to apply windowed transformations or filters to data. For example, you can calculate a trailing average or a moving median, or you can apply several operations at once to the same window. The two functions differ in these ways:

  • matlab.tall.movingWindow applies fcn to all windows of data, regardless of whether the windows are complete. matlab.tall.blockMovingWindow applies windowfcn to incomplete windows of data, and applies blockfcn to complete windows of data.

  • matlab.tall.movingWindow operates on single windows of data at a time. matlab.tall.blockMovingWindow operates on entire blocks of data containing multiple complete windows, which reduces the number of function calls required in the calculation.

Moving Window Syntaxes

The syntax to apply a moving window operation to single windows of data is

[tA, tB, tC, ...] = matlab.tall.movingWindow(fcn, window, tX, tY, tZ, ...)

The functional signature of fcn must be

[a, b, c, ...] = fcn(x, y, z, ...)

Similarly, the syntax to apply a moving window operation to entire blocks of data is

[tA, tB, tC, ...] = matlab.tall.blockMovingWindow(windowfcn, blockfcn, window, tX, tY, tZ, ...)

The functional signatures of windowfcn and blockfcn must be

[a, b, c, ...] = windowfcn(info, x, y, z, ...)
[a, b, c, ...] = blockfcn(info, bX, bY, bZ, ...)

The info input is a structure that contains the fields Window and Stride. When you write the function. use these fields to pick out windows of data in each block.

For an outline of general rules that fcn, windowfcn, and blockfcn must follow, see Functional Requirements for fcn. Aside from the info input, fcn and windowfcn have the same requirements. However, the requirements for blockfcn are different since that function operates on entire blocks of data.

Functional Requirements for windowfcn

The general functional signature of windowfcn is

[a, b, c, ...] = windowfcn(info, x, y, ...)
The info input is a structure provided by matlab.tall.blockMovingWindow that includes these fields:

  • Stride — Specified step size between windows (default: 1). Set this value with the 'Stride' name-value pair.

  • Window — Specified window size. Set this value with the window input argument.

windowfcn must satisfy these requirements:

  1. Input Arguments — The inputs [x, y, z, ...] are blocks of data that fit in memory. The blocks are produced by extracting data from the respective tall array inputs [tX, tY, tZ, ...]. The inputs [x, y, z, ...] satisfy these properties:

    • All of the inputs [x, y, z, ...] have the same size in the first dimension after any allowed expansion.

    • The blocks of data in [x, y, z, ...] come from the same index in the tall dimension, assuming the tall array is nonsingleton in the tall dimension. For example, if tX and tY are nonsingleton in the tall dimension, then the first set of blocks might be x = tX(1:20000,:) and y = tY(1:20000,:).

    • When the first dimension of any of [tX, tY, tZ, ...] has a size of 1, the corresponding block [x, y, z, ...] consists of all the data in that tall array.

    • Applying windowfcn must result in a reduction of the input data to a scalar or a slice of an array of height 1.

      When the input is a matrix, N-D array, table, or timetable, applying windowfcn must result in a reduction of the input data in each of its columns or variables.

  2. Output Arguments — The outputs [a, b, c, ...] are blocks that fit in memory to be sent to the respective outputs [tA, tB, tC, ...]. The outputs [a, b, c, ...] satisfy these properties:

    • All of the outputs [a, b, c, ...] must have the same size in the first dimension.

    • All of the outputs [a, b, c, ...] are vertically concatenated with the respective results of previous calls to windowfcn.

    • All of the outputs [a, b, c, ...] are sent to the same index in the first dimension in their respective destination output arrays.

  3. Functional Ruleswindowfcn must satisfy this functional rule:

    • F([inputs1; inputs2]) == [F(inputs1); F(inputs2)]: Applying the function to the concatenation of the inputs should be the same as applying the function to the inputs separately and then concatenating the results.

  4. Empty Inputs — Ensure that windowfcn can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data.

Functional Requirements for blockfcn

The general functional signature of blockfcn is

[a, b, c, ...] = blockfcn(info, bX, bY, bZ, ...)
The info input is a structure provided by matlab.tall.blockMovingWindow that includes these fields:

  • Stride — Specified step size between windows (default: 1). Set this value with the 'Stride' name-value pair.

  • Window — Specified window size. Set this value with the window input argument.

The blocks of data bX, bY, bZ, ... that matlab.tall.blockMovingWindow provides to blockfcn have these properties:

  • The blocks contain only full-sized windows. blockfcn does not have to define a behavior for incomplete windows of data.

  • The first window of data starts at the first element of the block. The last element of the last window is the last element of the block.

blockfcn must satisfy these requirements:

  1. Input Arguments — The inputs [bX, bY, bZ, ...] are blocks of data that fit in memory. The blocks are produced by extracting data from the respective tall array inputs [tX, tY, tZ, ...]. The inputs [bX, bY, bZ, ...] satisfy these properties:

    • All of the inputs [bX, bY, bZ, ...] have the same size in the first dimension after any allowed expansion.

    • The blocks of data in [bX, bY, bZ, ...] come from the same index in the tall dimension, assuming the tall array is nonsingleton in the tall dimension. For example, if tX and tY are nonsingleton in the tall dimension, then the first set of blocks might be bX = tX(1:20000,:) and bY = tY(1:20000,:).

    • If the first dimension of any of the data inputs [tX, tY, tZ, ...] has a size of 1, then the corresponding block [bX, bY, bZ, ...] consists of all the data in that tall array.

    • Applying blockfcn must result in a reduction of the input data such that the result has height equal to the number of windows in the block. You can use info.Window and info.Stride to determine the number of windows in a block.

      If the input is a matrix, N-D array, table, or timetable, then applying blockfcn must result in a reduction of the input data in each of its columns or variables.

  2. Output Arguments — The outputs [a, b, c, ...] are blocks that fit in memory, to be sent to the respective outputs [tA, tB, tC, ...]. The outputs [a, b, c, ...] satisfy these properties:

    • All of the outputs [a, b, c, ...] must have the same size in the first dimension.

    • All of the outputs [a, b, c, ...] are vertically concatenated with the respective results of previous calls to blockfcn.

    • All of the outputs [a, b, c, ...] are sent to the same index in the first dimension in their respective destination output arrays.

  3. Functional Rulesblockfcn must satisfy this functional rule:

    • F([inputs1; inputs2]) == [F(inputs1); F(inputs2)]: Applying the function to the concatenation of the inputs should be the same as applying the function to the inputs separately and then concatenating the results.

  4. Empty Inputs — Ensure that blockfcn can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data.

Control Output Data Type

If the final output from any of the Supported APIs has a different data type from the input, then you must specify the 'OutputsLike' name-value pair to provide one or more prototype arrays that have the same data type and attributes as the corresponding outputs. The value of 'OutputsLike' is always a cell array, with each cell containing a prototype array for the corresponding output argument.

For example, this call to matlab.tall.transform accepts one tall array tX as an input and returns two outputs with different types specified by the prototype arrays protoA and protoB. Output A has the same data type and attributes as protoA, and likewise for B and protoB.

C = {protoA protoB};
[A, B] = matlab.tall.transform(fcn, tX, 'OutputsLike', C)

A common way to supply the prototype arrays is to call fcn with trivial inputs of the proper data type, since the outputs returned by fcn have the correct data type. In this example, the transform function accepts a tall double, but returns a tall table. A prototype array is generated by calling fcn(0) and the prototype is specified as the value of 'OutputsLike'.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
ds.SelectedVariableNames = {'ArrDelay', 'DepDelay'};
tt = tall(ds);
tX = tt.ArrDelay;

fcn = @(x) table(x,'VariableNames',{'MyVar'});
proto_A = fcn(0);
A = matlab.tall.transform(fcn,tX,'OutputsLike',{proto_A});

Coding and Performance Tips

  • Put all analytics in a single function that you call to operate directly on the data, instead of using unnecessary nested functions.

  • Experiment using a small subset of the data. Profile your code to find and fix bottlenecks before scaling up to the entire data set, where bottlenecks can be greatly amplified.

  • Pay attention to the orientation of your data, since some functions return the outputs in different shapes depending on the input data. For example, unique can return either a row vector or a column vector depending on the orientation of the input data.

  • Blocks are dynamically generated at run-time based on available computer memory. Make sure that any specified reduction function obeys the function rule F([input1; input2]) == F([F(input1); F(input2)]). If this rule is not obeyed, then the results can differ significantly between trials.

  • Blocks can have any size in the first dimension, including 0 or 1. Size 0 or 1 can occur in intermediate calculations as a result of filtering or reduction operations. Make sure your function does the correct thing for both of these cases. One sign that the function does not handle these cases properly is when you receive an "Output is different size" error message.

See Also

| | |

Related Topics