Performance considerations for using strings or cellstr in table?

20 views (last 30 days)
Hello,
I'm looking for some information related to performance (speed, memory consumption) of using a cell array (cellstr) or string array when storing "strings" in a table variable.
More concretely, are there costs in terms of lost performance using string arrays over cellstr?
The strings aren't so long, perhaps between five to fifty characters typically.
Note: I did a little test related to memory usage, and strings seems to consume less memory than cellstr. But in my test case, char array was even less:
% Make some random char arrays of different lengths
D_cellstr = arrayfun(@(unused) char(randi([65 90], 1, randi([10 30]))), 1:100, 'uni', 0)';
% Make into char array, padded with spaces
D_char_array = char(D_cell_array{:});
% Put some non-char values in there, making it not a cellstr
D_cell_array = D_cellstr; D_cell_array([10:20:end]) = { NaN };
% Make some tables
TS_cellstr = table(D_cellstr(1:end)); % Extra code to use 'Var1' as variable name
TS_char_array = table(D_char_array(:,:));
TS_cell_array = table(D_cell_array(1:end));
TS_string = table(string(D_cellstr));
TS_string2 = table(string(D_cell_array(1:end)));
% Results
who TS_* D_*
% Name Size Bytes Class Attributes
%
% D_cell_array 100x1 14260 cell
% D_cellstr 100x1 14394 cell
% D_char_array 100x30 6000 char
% TS_cell_array 100x1 15244 table
% TS_cellstr 100x1 15378 table
% TS_char_array 100x1 6984 table
% TS_string 100x1 9712 table
% TS_string2 100x1 9338 table
  10 Comments
Bjorn Gustavsson
Bjorn Gustavsson on 12 Aug 2021
My only experience on a similar problem was in a project where we updated header-format on a yearly basis (until we wisened up and made more sensible specifications) and I had to make sure to adapt the meta-data extraction to handle all formats - but then I was along for the ride. Here you seem to be put in a position where you have to detective-out what to do. This seems like a task that's "better done by someone else" - for example have a (couple of) summer interns re-editing and proofing the files into a single proper data format it is just 15000x70...
Good luck.
dpb
dpb on 12 Aug 2021
Edited: dpb on 12 Aug 2021
There's not a real hard and fast way to decide when tables are too large -- but there is certainly an overhead to be paid for the convenience of the object.
Again, it mostly depends just what it is that is going to be done with the resulting file -- and, while it's annoying, time-consuming and tedious to bring in all the disparate files, that should be a one-time only exercise -- once done, you can save the new format and then no longer have to deal with the old. Hence, I'm not sure it's worth spending too much time on trying to optimize the performance of the input side of things.
But, if there is a need to process the resulting data by some set of these various variables like the type you mentioned above, then the builtin rowfun function with grouping variables is just the cat's meow for such.
You might be able to speed up the input with the import options object as noted above, but if they vary from one to another it may not help the speed but you might be able to save on the corrections in the code after reading by spending the time with the import object options instead to make them more consistent in returning expected data types and handling missing columns, etc., etc., ... That might be less time-consuming than trying to write the fix-up code after the dumb read via xlsread.
For example, that way can pre-determine that a column is to be interpreted as either numeric, string, cellstr, etc., etc., ... which will at least minimize the difficulty in cleaning up after inconsistent cell input where various cells within the same column/variable are of different types.

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 12 Aug 2021
nrow = 15000;
ncol = 70;
Chars = [' ', '0':'9', 'a':'z'];
NChars = length(Chars);
[R,C] = ndgrid(1:nrow, 1:ncol);
data_cell = arrayfun(@(r,c) Chars(randi(NChars,1,randi([5 50],1,1))), R, C, 'uniform', 0);
T1_cell = cell2table(data_cell);
T2_string = array2table(string(data_cell));
whos T1_cell T2_string
Name Size Bytes Class Attributes T1_cell 15000x70 166984847 table T2_string 15000x70 99927523 table
N = 10000;
rrow = randi(nrow, N, 1);
rcol = randi(ncol, N, 1);
tic; for K = 1 : N; this = T1_cell{rrow(K), rcol(K)}; end; r1_time = toc;
tic; for K = 1 : N; this = T2_string{rrow(K), rcol(K)}; end; r2_time = toc;
r1_time
r1_time = 1.0322
r2_time
r2_time = 0.7974
This tells us that using cellstr instead of string takes about 2/3 more storage (in this test), and is about 20% slower (in this test)

Categories

Find more on Tables in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!