Alternatives to using ismember() for string versus cell array of strings

Back when I was using R2009b, I got into the habit of using ~isempty(find(strcmp(thisstring,cellarrayofstrings))) in place of ismember(). In the case of testing a single string against an array of strings, this was about 50x as fast for my particular test arrays in R2009b.
Going back and revisiting some of the same optimization motivations in newer versions (particularly R2015b, though the students use newer versions), I find that it's still about 2x-4x as fast. As these sort of tests are often part of parsing input arguments to a function, these little bits of time just add to the overhead (sometimes as much as 20% of total execution time), and I'd like to pare them down where I can.
I know Jan has CStrAinBP() on the FEX, but at least for a single string versus an array, it's slower than my current approach. While the MEX variant is probably better, I can't practically introduce a dependency on something that a student is going to need priviliges or extra know-how to compile. Compiling MEX functions is just not really an option. That probably rules out ideal solutions.
I've seen some suggestions regarding pre-sorting, though I haven't observed any advantage in doing so. I've seen suggestions to use ismembc, though that requires enough ancillary preparation and output handling that it winds up being slower any way I try to use it. I've tried just making the expression slightly less ugly by doing something like any(strcmp(thisstring,cellarrayofstrings)) or sum(strcmp(thisstring,cellarrayofstrings))>0, though with some sets, those were very slightly slower in older versions.
To clarify, I'm only testing a single string against a cell array of 5-20 strings. I know I'm probably misguided to be chasing a few microseconds, but I'd be glad to know if someone knows a faster or less ugly way than my workaround.

2 Comments

any(strcmp(thisstring,cellarrayofstrings))
Possibly. In the case that thisstring is character vector or cell array with one character vector or scalar string object, then the any() is not needed.
I guess I shouldn't be calling them strings, should I? Sorry, I'm talking about char arrays and cell arrays of chars. If I use something like strcmp({'aaa'},{'aaa','bbb','ccc','ddd'}), I still get a logical vector that I'd need to reduce to a single logical value. Am I misunderstanding something?

Sign in to comment.

 Accepted Answer

In general my preference is "any(strcmp(s,cellchar))". I'm not sure if I did some benchmark to come to this conclusion, what I know is my codes have exclusively this syntax. Sometime I also use "ismember(s, c)" for clarity and maintainability.
Depending on what those chars mean in the program, but if they represents a finite set of something, better not to use char/string at all, but replace them with integer values (sort of enum). Lately MATLAB introduces Enumeration Class. I haven't use them yet, so no comment about speed.
Here is however an update of benchmark times of various methods for R2020a run on my Windows laptop (sec):
ismember(s,cellchar): 0.746642
~isempty(find(strcmp(s,cellchar),1)): 0.490355
any(strcmp(s,cellchar)): 0.282890
ismember(s,cellstring): 7.404710
You'll observe that the STRING class (last result, see code bellow for more detail) is remarkable slow compared to historical CHAR class.
This is a general common observation of mine: As most of the recent class MATLAB introduces (table, string, ...). They might look nicer in term of objects and method, but often they are built on top of leaner original class, thus slower. I also find their definitions/methods often have too much profileration, and users lost sense of what is beneath, whereas the historical MATLAB class is just right, closer "mathematically truth" of what those objects are. People who start learning MATLAB seem to get more confuse than ever. The small differences bewteen different classes/methods become overwhelming, and users will more likely lost their focus on what they suppose to do, meaning carry out the algorithm in an hight (more abstract) level.
For reference the benchmark code is bellow; it generates randomly a cell of 8 chars, each of them have length(8). The test loop on 9 cases where the string would be found in position 1-8 of the "patterns", plus the case where the string is not found.
function benchstrcmp
ntests = 100000;
cellchar1 = cellstr(char('a'-1+randi('z'-'a'+1,9,8)));
cellchar = cellchar1(2:end);
%%
tic
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = ismember(s,cellchar);
end
end
t=toc;
fprintf('ismember(s,cellchar): %f\n', t);
%%
tic
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = ~isempty(find(strcmp(s,cellchar),1));
end
end
t=toc;
fprintf('~isempty(find(strcmp(s,cellchar),1)): %f\n', t);
%% This is the fastest
tic
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = any(strcmp(s,cellchar));
end
end
t=toc;
fprintf('any(strcmp(s,cellchar)): %f\n', t);
%% This is slowest by large
cellstring1 = string(cellchar1);
cellstring = string(cellchar);
tic
for k=1:length(cellchar1)
s = cellstring1(1);
for n=1:ntests
b = ismember(s,cellstring);
end
end
t=toc;
fprintf('ismember(s,cellstring): %f\n', t);

6 Comments

I went ahead and added a fourth case to reflect how I've been using it, and this is what I observe. I guess I got myself backwards when mentioning the version-specific differences, but I do see a minor speed advantage for the third case when compared to the fourth. I would think that your usage would cause an earlier exit from find(), but my guess is that the array is too short for that to be of benefit.
R2015b
ismember(s,cellchar): 6.604402
~isempty(find(strcmp(s,cellchar),1)): 1.050134
~isempty(find(strcmp(s,cellchar))): 0.901748
any(strcmp(s,cellchar)): 1.026745
R2009b
ismember(s,cellchar): 102.029833
~isempty(find(strcmp(s,cellchar),1)): 1.321144
~isempty(find(strcmp(s,cellchar))): 1.078712
any(strcmp(s,cellchar)): 1.069911
For what it's worth, I'd prefer to use any(strcmp()), simply to reduce the confusion. I may try to test this on newer versions once the other labs open back up.
Although it will not make any difference to the conclusions, using tic/toc is not the best for timing. A better approach is timeit.
function benchstrcmp
ntests = 100000;
cellchar1 = cellstr(char('a'-1+randi('z'-'a'+1,9,8)));
cellchar = cellchar1(2:end);
cellstring1 = string(cellchar1);
cellstring = string(cellchar);
%%
fh1 = @() fun1(cellchar1, cellchar, ntests);
fh2 = @() fun2(cellchar1, cellchar, ntests);
fh3 = @() fun3(cellchar1, cellchar, ntests);
fh4 = @() fun4(cellstring1, cellstring, ntests);
t1 = timeit(fh1,0);
fprintf('ismember(s,cellchar): %f\n', t1);
%%
t2 = timeit(fh2, 0);
fprintf('~isempty(find(strcmp(s,cellchar),1)): %f\n', t2);
t3 = timeit(fh3, 0);
fprintf('any(strcmp(s,cellchar)): %f\n', t3);
t4 = timeit(fh4, 0);
fprintf('ismember(s,cellstring): %f\n', t4);
end
%%
function b = fun1(cellchar1, cellchar, ntests)
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = ismember(s,cellchar);
end
end
end
%%
function b = fun2(cellchar1, cellchar, ntests)
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = ~isempty(find(strcmp(s,cellchar),1));
end
end
end
%% This is the fastest
function b = fun3(cellchar1, cellchar, ntests)
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = any(strcmp(s,cellchar));
end
end
end
%% This is slowest by large
function b = fun4(cellstring1, cellstring, ntests)
for k=1:length(cellstring1)
s = cellstring1(k);
for n=1:ntests
b = ismember(s,cellstring);
end
end
end
I'll have to look into timeit, though that was introduced in 2013. That kind of limits my ability to compare results between versions. I don't think anyone actually still uses my (probably extreme) legacy test case of R2009b, but I bet someone is still running an old version of R2012x somewhere. Maybe it's time I ask around and actually find out what exactly everyone is running.
timeit was a formerly a File Exchange contribution, introduced in 2008 (I think it was.) Unfortunately it no longer seems to be posted.
Probably some people still have copies of the old function around. (Jan Simon perhaps.)
I saw that, and the thought crossed my mind. I'll try using the version in R2015b and see if I want to pursue that. I suppose I could just look at timeit.m and learn from that.
Sorry but personally I'm not a big fan of TIMEIT for two 2 reasons:
  1. it tries to correct the how eventually the output parameters transfered to caller variables.
  2. It requires a wrap around function, which can be a strong bias when the "thing" that we investigate has a short runtime.
Why not just run the code natually as in REAL life and doing tic/toc to measure them?
Timeit seems to try to get rid of side times between the entanglement between the CPU to perform the main task and all secondary stuffs needed around for the main task, possibly trying to be independent also of OS activity/jitter by doing median time instead of mean time . This undoing seem to me unnatual and overkill to me.
After all tic/toc is the time user have to wait for MATLAB+OS doing some stuffs with some kind of processor. It cannot be more direct than that, and I'm personally interested in this time returned by tic/toc, nothing else matter more than that, period.
I don't want to use timeit in-place of tic/toc.

Sign in to comment.

More Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Asked:

DGM
on 19 Jul 2020

Edited:

on 20 Jul 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!