Speed performance: Find all y-vector entries that have the same value in an x-vector of equal length.

1 view (last 30 days)

Hi Guys,

imagine you have an x-vector

   x = [1 1 2 2 1 1];

as well as an y-vector of equal length.

    y = [1 2 3 4 5 6]

And you want to find out, which values the y-vector has at those indices where the x-vector has the value 1, as well as the y-vector values at those indices where the x-vector has the value 2. So that the solution for this example would be

solution = {[1 2 5 6], [3 4]} 

That is the first cell of 'solution' contains all y-values, that belong to the first unique x-value (1) & the second cell of it contains all y-values, that belong to the second unique x-value (2).

My ideas on how to implement this for vectors of any length are given below.

Does anybody has an idea, how to avoid the for-loop in approach A???

x = randi(10, 1, 100000);
y = randi(5, 1, 100000);
% Approach A:
tic 
xVals = unique(x); % get unique x values
tf = x == xVals';  % get true-or-false logical array (each row specifies where the n-th value of xVals occurs in x).
yVals = cell(1, size(tf,1)); % preallocate yVals
for i = 1:size(tf,1)
    yVals(i) = {y(tf(i,:))}; % Assign all y-values that belong to one xVal-entry to one cell.
end
toc
% Approach B: 
tic
[xs, id] = sort(x); % xsorted
ys = y(id); % sort y the same way x was sorted
[xVals, ia] = unique(xs);
% Divide the ys-vector into sections whose corresponding xs-indices have the same value in the xs vector.
yVals = mat2cell(ys, [1], [diff(ia)', length(ys)-sum(diff(ia))]); 
toc
% Runtimes:
%  x = randi(10, 1, 1000000);
%  y = randi(5, 1, 1000000);
%  A: Elapsed time is 0.100635 seconds.
%  B: Elapsed time is 0.149284 seconds. 
%  x = randi(100, 1, 1000000);
%  y = randi(5, 1, 1000000);
%  A: Elapsed time is 0.886396 seconds.
%  B: Elapsed time is 0.139918 seconds.
% Comparison
% A is faster than B if there are few different x-values (about 10-20). 
% The number of different x-values has high impact on speed of A.
% A is faster than B if x & y-vector are extremely long (> 100000).
% The vector length has low impact on speed of B.
  7 Comments
Star Strider
Star Strider on 20 Apr 2018

My pleasure!

I was surprised that accumarray was slower than the loop, even though it makes for neater code.

I didn’t test ‘Approach B’, since ‘Approach A’ (chosen randomly) was significantly faster than my accumarray call.

Zwithouta
Zwithouta on 20 Apr 2018
Edited: Zwithouta on 20 Apr 2018

Enjoy it with caution! Approach A really suffers terribly from increasing the number of unique-values in the x-vector, since it increases the number of needed for-loop iterations. Approach B in contrast shows a really stable performance.

y = randi(5, 1, 1000000);
x = randi(10, 1, 1000000);
Elapsed time is 0.108344 seconds. % A
Elapsed time is 0.195844 seconds. % B
x = randi(100, 1, 1000000);
Elapsed time is 0.877760 seconds. % A
Elapsed time is 0.182726 seconds. % B
x = randi(1000, 1, 1000000);
Elapsed time is 11.379839 seconds.% A
Elapsed time is 0.196772 seconds. % B

Sign in to comment.

Answers (1)

Paul Shoemaker
Paul Shoemaker on 20 Apr 2018

If you're trying to avoid a FOR loop you can use arrayfun, although I'm not sure it improves speed.

It would look something like this:

x = [1 1 2 2 1 1];
y = [1 2 3 4 5 6];
solution = arrayfun(@(z) y(z==x),unique(x),'uniformoutput',false);

Paul Shoemaker

http://www.matlabinvesting.com

  1 Comment
Zwithouta
Zwithouta on 20 Apr 2018
Thanks for sharing your thoughts on that topic!
Your approach is the fastest for long x-vectors so far, but suffers similar to approach A from high numbers of unique x-values.

Sign in to comment.

Categories

Find more on Graphics Objects in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!