23 views (last 30 days)

I'm calculating forward kinematics for a robotic maniupulator running at 500 hz. This means that I have millions of sets of joint angles that I'm doing forwad kinematics on and I'm attempting to do this mathematics as quickly as possible. I've been able to do everything quickly outside of a loop except for the final matrix multiplication. My loop takes multiple arrays that are Nx4, it's broken down into 4x4 matrices that are multiplied by eachother to get from base frame to the final end effector frame. Here's a sample of the data (I can't share as much info as I would hope), it's 3 arrays each with 3 sets of 4x4 matrices.

Tc0 = [1,0,0,0; 0,1,0,0; 0,0,1,0; 0,0,0,1; 1,0,0,0; 0,1,0,0; 0,0,1,0; 0,0,0,1; 1,0,0,0; 0,1,0,0; 0,0,1,0; 0,0,0,1];

T01 = [0.74,-0.66,0,0.0012; 0.66,0.74,0.002,0.0007; -0.0013,-0.0015,0.999,0.35; 0,0,0,1; 0.746,-0.66,0,0.0012; 0.66,0.74,0.002,0.0007; -0.0013,-0.0015,0.99,0.35; 0,0,0,1; 0.746,-0.66,0,0.0012; 0.66,0.74,0.002,0.0007; -0.0013,-0.0015,0.99,0.35; 0,0,0,1];

T12 = [0.82,0.57,0,0.003; 0.0009,-0.0012,-0.999,-0.25; -0.57,0.82,-0.0015,-0.0004; 0,0,0,1; 0.82,0.57,0,0.003; 0.0009,-0.0012,-0.999,-0.25; -0.57,0.82,-0.0015,-0.0004; 0,0,0,1; 0.82,0.57,0,0.003; 0.0009,-0.0012,-0.999,-0.25; -0.57,0.82,-0.0015,-0.0004; 0,0,0,1];

q = size(T12,1);

Tc3 = zeros(q,4);

for i=1:4:q

Tc3(i:i+3,1:4) = ((Tc0(i:i+3,1:4)*T01(i:i+3,1:4))*T12(i:i+3,1:4));

end

So that takes about 2 seconds for every 100,000 loops, which isn't bad. But over millions of data points it really adds up. I'd like to avoid using a loop but I can't just multiply the entire arrays together (not that I know of since elementwise multiplication isn't correct).

Any ideas?

Thanks

EDIT: I'm wondering if maybe there would be a way to do this using 3D arrays? There Tc0 = reshape(Tc0,4,4,3); T01=reshape(T01,4,4,3); T12=reshape(T01,4,4,3) and then somehow multiply each immediately with respect to their 3rd dimension index?

James Tursa
on 16 Apr 2019

Edited: James Tursa
on 16 Apr 2019

First, some basic data storage advice. When you have multiple matrices stored in a larger matrix or array, it is usually best to have the individual matrices stored contiguously in memory. This does two important things for you:

(1) Allows the cache to work better for you because you don't have to jump around in memory to piece together the elements to form the smaller matrices you want to work with.

(2) Allows you to use several of the FEX submissions that do nD matrix multiply on the 2D pages of your smaller matrices.

Unfortunately, you don't have this situation with your data. Because you have stacked your individual 4x4 matrices vertically in your larger matrix the individual 4x4 matrix elements are scattered all over in memory and are not contiguous. This defeats what the cache would have helped you with and slows down your 4x4 matrix access times. And the sub-matrix formation also creates a lot of temporary matrices that need data copying. Most of the time will be spent doing this extraneous stuff rather than doing the actual multiply calculations.

I would suggest that you reorder your data to make the 4x4 matrices contiguous in memory. That will make the downstream programming much easier and allow you to use the very fast FEX submissions for the actual multiplies.

E.g., here is a comparison of your looping M-code with 4x4 elements scattered in memory vs the same calculations done in compiled mex C-code using OpenMP parallel processing:

% Some large sample data

n = 1000000;

Tc0 = rand(4*n,4);

T01 = rand(4*n,4);

T12 = rand(4*n,4);

% The loop method with 4x4 elements scattered in memory

disp(' ');

disp('The M-code loops method:')

tic

q = size(T12,1);

Tc3 = zeros(q,4);

for i=1:4:q

Tc3(i:i+3,1:4) = ((Tc0(i:i+3,1:4)*T01(i:i+3,1:4))*T12(i:i+3,1:4));

end

toc

% Suppose we had the 4x4 elements contiguous in memory to start with in 3D arrays

Tc0t = reshape(Tc0.',4,4,[]); Tc0t = permute(Tc0t,[2 1 3]);

T01t = reshape(T01.',4,4,[]); T01t = permute(T01t,[2 1 3]);

T12t = reshape(T12.',4,4,[]); T12t = permute(T12t,[2 1 3]);

disp(' ');

disp('The MTIMESX method:');

fprintf('The compiler used: %s\n',mtimesx('compiler')) % Load mtimesx and print out the compiler used

fprintf('The calculation mode: %s\n',mtimesx('speedomp')) % Put mtimesx into fastest mode

fprintf('Number of threads available: %d\n',mtimesx('omp_get_max_threads')) % Print number of threads available

% Do the nD matrix multiply on the 2D pages with compiled parallel mex C-code

tic

Tc3x = mtimesx(mtimesx(Tc0t,T01t),T12t);

toc

% Compare the results

disp(' ');

disp('Are the results equal?');

Tc3m = permute(reshape(Tc3.',4,4,[]),[2 1 3]);

isequal(Tc3m,Tc3x)

And a sample run:

The M-code loops method:

Elapsed time is 4.797704 seconds.

The MTIMESX method:

The compiler used: Microsoft_Visual_C++_2013_Professional_(C)_12.0_cl

The calculation mode: SPEEDOMP

Number of threads available: 16

Elapsed time is 0.151637 seconds.

Are the results equal?

ans =

logical

1

So, 4.797704 sec vs 0.151637 sec. That means 97% of your M-code method is spent just doing data copying and access and temorary variable creation etc, and only 3% of your M-code is spent doing the actual multiplies. Not a good ratio for the M-code! In the above example, MTIMESX actually used unrolled parallel loops to do the individual 4x4 matrix multiples (it will do this for sizes up to 5x5). For larger 2D matrix sizes, it will call into the BLAS library that MATLAB uses to do the matrix multiples. But in all cases it does not do any data copying or temporary variable creation to do the work ... it points into the variables directly to get at the 4x4 matrices.

The above example uses an FEX submission called MTIMESX which requires a supported C compiler to be installed on your machine. The author hasn't updated it for recent versions of MATLAB (he really needs to find time to do this!), so you may have an adventure getting it to compile on your machine. This, and other FEX submissions that you could use for this calculation (assuming you reorder your data as suggested) can be found here:

MTIMESX: (mex, needs C-compiler)

MMX: (mex, needs C-compiler)

MULTIPROD: (M-code)

Jon
on 16 Apr 2019

Very enlightening discussion - thank you.

Interestingly, the final approach I outlined in the comments to my earlier answer in which I utilized a sparse block diagonal matrix, the time for 1e6 individual matrices (as in your example) is 0.8 seconds compare to 4 for the original. I would infer from your discussion that the sparse block diagonal matrices are also arranged more efficiently in memory. So not quite as good as the dedicated library, but if that is hard to get working, due to version issues, it might be something to look at.

James Tursa
on 16 Apr 2019

Sign in to comment.

Jon
on 16 Apr 2019

Edited: Jon
on 16 Apr 2019

I don't know if it would be any more efficient than your loop (you could check with some timing tests) but here is an alternative approach without loops. The idea is to put each of your three 4x4 matrices that are currently stacked in a 12x4 matrices into a block diagonal matrix with each of the 4x4 matrices on the main diagonal. Then you can directly multiply each of these block diagonal matrices to get your result. If you need to get it back into stacked form you can do this through indexing at the end. I'm not sure from the description you gave whether you always have just three stacked matrices or whether that was just for the example. If you typically have many more than three stacked matrices, then this approach will not scale well as you would get huge 4*N x 4*N block diagonal matrices where N is the number of stacked matrices.

Anyhow in case it is helpful here is some code that illustrates the approach

% put stacked matrices into block diagonal matrices and multiply

Tc3diag = blkdiag(Tc0(1:4,:),Tc0(5:8,:),Tc0(9:12,:))*...

blkdiag(T01(1:4,:),T01(5:8,:),T01(9:12,:))*...

blkdiag(T12(1:4,:),T12(5:8,:),T12(9:12,:));

% collapse the output block diagonal matrix into a stacked matrix

Tc3= [Tc3diag(1:4,1:4);Tc3diag(5:8,5:8);Tc3diag(9:12,9:12)];

Jon
on 16 Apr 2019

I was also thinking along the lines of storing the values in a 3d array rather than a stacked 2d array. It still seems that you would have to have a loop, but I made the following script to compare the timing for the two approaches:

% timing comparison test on block matrix multiplication for large matrices

N = 1e6; % number of individual matrices

% looping through stacked matrices (current approach)

Tc0 = rand(4*N,4);

T01 = rand(4*N,4);

T12 = rand(4*N,4);

q = size(T12,1);

Tc3 = zeros(q,4);

disp(' ')

disp('elapsed time for current approach')

tic

for i=1:4:q

Tc3(i:i+3,1:4) = ((Tc0(i:i+3,1:4)*T01(i:i+3,1:4))*T12(i:i+3,1:4));

end

toc

% looping through 3d array (alternate approach)

Tc0 = rand(4,4,N);

T01 = rand(4,4,N);

T12 = rand(4,4,N);

Tc3 = zeros(4,4,N);

disp(' ')

disp('elapsed time for alternate approach')

tic

for i = 1:N

TC3 = Tc0(:,:,i)*T01(:,:,i)*T12(:,:,i);

end

toc

I found that the 3d approach was in fact faster typically it took approximately 60% as long as the current approach. So not a huge savings, but something. I think to use this though you would want to always store the data in the 3d arrays. Otherwise, I think you'd probably use up some of your savings with any reshaping.

By the way, if you needed to do the reshaping, you can do this using

Tc0_3d = permute(reshape(Tc0',4,4,N),[2,1,3])

Note that in the above you first transpose the matrix , and then permute the matrix dimensions to deal with the fact that the reshape acts columnwise.

Jon
on 16 Apr 2019

Actually, going back to the original idea of keeping the data in block diagonal matrices, this may still be a viable approach if you use sparse block diagonal matrices. In this case I think the memory requirements would be similar to what you are doing already. In the attached code I did a timing test for this approach with N = 1e6 individual blocks. I found that the matrix multiply, which is now just one line of code, takes only 0.8 seconds compared to 2.5 seconds to 4 seconds for the other approaches. This seems to be a substantial speedup. I don't know whether you have the freedom to rewrite the underlying code so that you can just stay with sparse block diagonal matrices. Otherwise going back and forth from stacked to sparse block diagonal would probably eat up any time savings.

%% sparse block matrix multiplication

N = 1e6

% make sparse block diagonal matrices

Tc0blocks = cell(N,1);

T01blocks = cell(N,1);

T12blocks = cell(N,1);

% loop to assign individual block elements

for k = 1:N

Tc0blocks{k} = sparse(rand(4,4));

T01blocks{k}= sparse(rand(4,4));

T12blocks{k} = sparse(rand(4,4));

end

% combine into block diagonal matrices

% They will be sparse because elements are sparse

Tc0 = blkdiag(Tc0blocks{:});

T01 = blkdiag(T01blocks{:});

T12 = blkdiag(T12blocks{:});

disp(' ')

disp('elapsed time for sparse block diagonal matrix approach')

tic

Tc3 = Tc0*T01*T12;

toc

Note in my timing I only count the part where the multiplication if done. The other part is just setting up to do the test

Sign in to comment.

Sign in to answer this question.

Opportunities for recent engineering grads.

Apply Today
## 0 Comments

Sign in to comment.