Accelerating array generation, when composed of square tiles from expensive function

6 views (last 30 days)
I have a 2*N-by-2*N matrix whose 2x2 tiles are defined by L*PSI*R, where L is a 2x3 array, PSI is a 3x3 array, and R is a 3x2 array.
It's a bit trickier, since L,R, and PSI depend on which 2x2 block is being generated. Fortunately, L and R come from the same array, and the basic structure smells parallelizable.
I've developed a few different approaches for generating this matrix, but I have an intellectual curiosity in finding better ways of doing it. I'd greatly appreciate any hints or full solutions.
In the code below, I never explicitly declare PSI. Its generation is quite intricate and distracts from the issue considerably (it would be a few dozen lines of garbage). The one thing that may help is that PSI is symmetric in its arguments, up to transpose (i.e., PSI(X(i,:),X(j,:)) == PSI(X(i,:),X(j,:))'). Feel free to use something like PSI = @ (x,y) rand(3).
Basic Setup:
matlabpool(12);
% N defines the size of the problem. Typically, N is between 1000
% and 5000, but only because this code takes so long to compute.
% I'd like to push N>10^4
N=10;
% Define a vector of points
X = rand(3,N)
% Element-wise multiplication of the columns of A against the
% vector v.
timesMatVec = @(A,b) bsxfun(@times,A,b(:));
% Change-of-basis vectors
d = @(x) timesMatVec([(-x(:,3).*x(:,1)) (-x(:,3).*x(:,2)) ...
(1-x(:,3).^2)],(1./sqrt(1-x(:,3).^2)));
e = @(x) timesMatVec([(-x(:,2)) x(:,1) ...
0*x(:,1)],(1./sqrt(1-x(:,3).^2)));
% And the corresponding arrays:
dmat = d(X(:,:));
emat = e(X(:,:));
% PSI takes two elements from X and returns a 3x3 array. Consider PSI to be an expensive
function to compute--a few hundred FLOP per call.
PSI = @(x,y) [p11 p12 p13; p21 p22 p23; p31 p32 p33];
And now, two different methods for evaluating.
First, the for-loop version to highlight what's going on:
% Loop version of evaluation:
for i = 1:size(X,1)
for j = 1:size(X,1)
Asbf( (2*i-1):(2*i),(2*j-1):(2*i) ) =
[dmat(i,:)' emat(i,:)']*...
PSI(X(j,:),X(i,:))*...
[dmat(j,:)' emat(j,:)']'
end
end
This is the arrayfun + parfor version I'm currently using. This is decently speedy, but at the expense of putting hugely redundant data in RAM. There is so much waste that my 32GB workstation is saturated when N=3249.
The basic strategy here is to compute 2-by-2*N strips at a time, since X(:,j) and R remain fixed through a strip. Parfor then offloads the strip generation to individual workers.
% Arrayfun version, with parallel toolbox
% Have to set up Asbf for use with parfor
Asbf = num2cell(Asbf, [2 3]);
% Still not quite right, so reshape.
% TODO: build this in a smarter way to avoid parfor
parfor i=1:N
Asbf{i}=reshape(Asbf{i},2,[]);
end
% Now compute. This is analogous to the for loop above
parfor i =1:size(X,1)
Asbf{i} = cell2mat(...
arrayfun(@(j) leftmat{j}*PSI(X(j,:), X(i,:))*leftmat{i}',...
(1:size(X,1))','UniformOutput', 0)...
);
end
% Asbf comes out interleaved, so it should be fixed.
% TODO: try doing this in one reshape
Asbf=cell2mat(Asbf);
Asbf_buff=reshape(Asbf,2*N,[]);
Asbf=zeros(2*N,2*N);
% Now interleave rows
Asbf(:,1:2:end)=Asbf_buff(:,1:N);
Asbf(:,2:2:end)=Asbf_buff(:,(N+1):end);
Any insight or suggestions are greatly appreciated, be they better ways of generating the array, some advice on configuring the cell arrays so parfor makes better use of memory, or otherwise. Even an incremental improvement will carry me some way.

Answers (1)

Sean de Wolski
Sean de Wolski on 11 Nov 2013
Edited: Sean de Wolski on 11 Nov 2013
I don't have time to look into this too much today or tomorrow, here are a few tips
  • Avoid using cell-arrays with parfor because their indexing will likely require the variable to not be sliced.
This is exactly what we see above with your parfor-loops.
  • This can be replaced with a matrix multiplication (as can most calls to bsxfun(@times,...
timesMatVec = @(A,b) bsxfun(@times,A,b(:));
  • Pull your d and e into separate function files. I'd expect this to be faster than anonymous functions.
  • i does not change on iterations of the j loop. So pull the calculation out of the j loop.
[dmat(i) emat(i)]*PSI(X(j,:),X(i,:))*...
To
for i
edi = [dmat(i) emat(i)];
for j = blah
Same with the 2*i
Have you used the profiler to profile the code and identify the bottlenecks?
doc profile
  1 Comment
David
David on 11 Nov 2013
I appreciate the feedback. I'll see what I can do about getting the arrays to be sliced variables. I suspect that this is the biggest culprit.
I haven't been able to figure out the proper expansion for the GEMM in e and d, hence the bsxfun. This is used to expand into the matrix I actually do want (which takes an insignificant amount of time compared to the existing bottlenecks).
The other comments are very agreeable and reflect the production code. Many thanks!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!