Building tall table from tall arrays generates error

2 views (last 30 days)
clear
dataFile = 'data.csv';
ds = tabularTextDatastore(dataFile, FileExtensions='.csv');
ds.ReadVariableNames = true;
ds.Delimiter = ',';
ds.SelectedVariableNames = ["hash", "count"];
ds.SelectedFormats = {'%s', '%f'};
data = tall(ds);
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 2).
[g, THash] = findgroups(data.hash);
TCount = splitapply(@(x) {x}, data.count, g);
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 1: 0% complete - Pass 1 of 1: 100% complete - Pass 1 of 1: Completed in 1.9 sec Evaluation completed in 2.8 sec
count = gather(TCount);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 3: 0% complete - Pass 1 of 3: 100% complete - Pass 1 of 3: Completed in 0.54 sec - Pass 2 of 3: 0% complete - Pass 2 of 3: 100% complete - Pass 2 of 3: Completed in 0.46 sec - Pass 3 of 3: 0% complete - Pass 3 of 3: 100% complete - Pass 3 of 3: Completed in 0.58 sec Evaluation completed in 2.3 sec
T1 = table(hash, count);
%% This is the intended code but doesn't work
TT = table(THash,TCount);
Error using tall/table
Incompatible non-scalar tall array arguments. Each of the tall arrays must be the same size in the first dimension, must be derived from a single tall array, and must not have been indexed
differently in the first dimension (indexing operations include functions such as VERTCAT, SPLITAPPLY, SORT, CELL2MAT, SYNCHRONIZE, RETIME and so on).
write(fullfile(pwd,'data'),TT,FileType="parquet");

Answers (1)

Oguz Kaan Hancioglu
Oguz Kaan Hancioglu on 15 Mar 2023
Your code wasn't work because "gather(TCount)" returns cell array for each element. Therefore you are trying to write double array in to one single cell. You can find the length of each array into the cell. I hope this solves your problem.
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
count = gather(TCount);
cellsz = cellfun(@size,count,'uni',false);
newCount = cellfun(@(x) x(1),cellsz,'UniformOutput',false)
T1 = table(hash, newCount);
  1 Comment
Harry Cho
Harry Cho on 15 Mar 2023
Thank you for the reply. Unfortunately I have to collect cell array, in which each cell has different length of double array. My question is why it works in-memory table T1 but not in tall table TT.

Sign in to comment.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!