Speeding up matrix exponentials
34 views (last 30 days)
Show older comments
Hey all,
I am trying to speed up the calculation of matrix exponentials but the process is not fast enough and I cannot think of a more efficient way (without using parfor). The code is:
a = diag(ones(1,999),1) + diag(-ones(1,999),-1);
b = expm(a);
c = [];
counter = 1; %keep track of progress of for loop%
for i = 1:10^4-1
factor = exp(1j*(i+1)) - exp(1j*i);
c = [c,b^factor];
counter
counter = counter+1;
end
This process takes too long for 10^4 iterations so I was hoping there was a faster method, maybe some kind of vectorization or something?
Thanks
7 Comments
Mike Croucher
on 4 Apr 2023
Edited: Mike Croucher
on 4 Apr 2023
Can you talk about your application any more? Some questions I have include
- Would single precision be good enough for you?
- There is no pagempower function. But if there were would it have helped you? If so, a bit more info about your application would be really useful.
- Why can't you use parfor?
- What hardware do you have access to? HPC cluster? GPU?
Accepted Answer
Mike Croucher
on 4 Apr 2023
Edited: Mike Croucher
on 10 Apr 2023
Don't let the fact that I'm MathWorks staff fool you here -- I am playing outside of my comfort zone and this suggestion may be a Bad Idea! Use at your own risk.
With that said. Let's look at a smaller number of iterations for the purposes of timing
a = diag(ones(1,999),1) + diag(-ones(1,999),-1);
b = expm(a);
c = cell(100,1);
tic
for k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = b^fact;
end
toc
This took 58.9 seconds on my machine. MATLAB R2023a,
A bit of googling indicated the matrix power, A^z where z is a complex scalar can be written in terms of matrix exponental and matrix logaritrm. That is:
A^z == expm( z*logm(A))
I guessed 2 things.
- That maybe this is faster than the built-in mpower for this case
- There aren't any numerical gotchas that I don't know about!
Now, you want to do b^fact where b=expm(a). So logm(b) is just a in this case. So we can write
b^fact == expm(z*a)
Let's give it a try
a = diag(ones(1,999),1) + diag(-ones(1,999),-1);
b = expm(a);
c = cell(100,1);
tic
for k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = expm(fact*a);
end
toc
Runs in 26.3 seconds so it's definitely faster. But is it correct? Let's compute them side by side and look for maximum absolute difference in any element.
a = diag(ones(1,999),1) + diag(-ones(1,999),-1);
b = expm(a);
c = cell(100,1);
%accuracy check
maxdiff = 0;
for k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
fastermethod = expm(fact*a);
slowmethod = b^fact;
diff = max(abs(fastermethod-slowmethod),[],'all');
if diff > maxdiff
maxdiff = diff;
end
end
fprintf("Maximum absolute difference in any matrix element is %e\n",maxdiff);
Maximum absolute difference in any matrix element is 2.608528e-14
Seems we may be OK.
I'm off to ask the linear algebra team if this is OK! (Update): Yeah, it's OK.
Update following your answers to my questions in the comment
Single precision
You said that single precision was OK. It is often the case that matrix operations in single precision are faster than in double precision. That is the case here. All we need to do is to turn a into a single precision matrix
a = single(diag(ones(1,999),1) + diag(-ones(1,999),-1));
b = expm(a);
c = cell(100,1);
tic
for k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = expm(fact*a);
end
toc
Elapsed time is 13.950966 seconds.
I strongly suggest checking your results. I'm always nervous when going to single precision.
Making parfor faster
You said that parfor made things slower. Perhaps this is necause you are using the 'traditional' process pool. This has each worker as it's own process. It's set as the default out of the box but to be explicit I mean this
%Double precision and parallel process pool
parpool("processes")
a = diag(ones(1,999),1) + diag(-ones(1,999),-1);
b = expm(a);
c = cell(100,1);
tic
parfor k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = expm(fact*a);
end
toc
Elapsed time is 28.948511 seconds.
Indeed, the slowest so far. I suggest you try the newer thread pool. First, make sure you delete any existing parallel pool
delete(gcp('nocreate'))
Then let's try our code again using a thread pool
parpool("threads")
a = diag(ones(1,999),1) + diag(-ones(1,999),-1);
b = expm(a);
c = cell(100,1);
tic
parfor k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = expm(fact*a);
end
toc
Elapsed time is 20.230263 seconds.
This is the double precision version of the code which took 26.3 seconds originally so its faster. Not much...but enough to make it worth it....on my machine at least. I have an 8 core system.
You may be disappointed with the performance here but functions like expm are mulithreaded anyway. Each call to expm was using multiple cores and so was 'implicitly multithreaded'. When I use a parfor on my machine, I make 8 expm calls simultaneuously but each one takes longer because each expm is only using 1 core now.
The practical upshot of all of this is that for this code, we are not going from serial to parallel when going from for to parfor. We are going from one type of parallelism to another which just happens to be a little more efficient for my machine. Might be different for you. Worth a try though
We can combine single precision and parfor which, on my machine gives us
delete(gcp('nocreate')) %% Delete a pool if it already exists
parpool("threads")
a = single(diag(ones(1,999),1) + diag(-ones(1,999),-1));
b = expm(a);
c = cell(100,1);
tic
parfor k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = expm(fact*a);
end
toc
Elapsed time is 9.022758 seconds
Using a GPU
I have an NVIDIA GeForce RTX 3070 which is pretty weak in double precision but very powerful in single. I'll skip the double precision version (It's slower on my machine and I expected this to be the case) and go straight to single
% Set up GPU
dev = gpuDevice();
% Create a on the gpu.
a = gpuArray(single(diag(ones(1,999),1) + diag(-ones(1,999),-1)));
b = expm(a);
c = cell(100,1);
tic
for k = 1:10^2-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = gather(expm(fact*a));
end
wait(dev) %Wait for all GPU computations to complete before calling toc
toc
Elapsed time is 1.968139 seconds.
This last solution scales nicely. Up until now, We've only been doing 99 iterations to give the time down to something manageable but I can do all 10^4-1 = 9999 iterations you originally asked for no problem
% Do the full 9999 iterayions on GPU
% Set up GPU
dev = gpuDevice();
% Create a on the gpu.
a = gpuArray(single(diag(ones(1,999),1) + diag(-ones(1,999),-1)));
b = expm(a);
c = cell(10000,1);
tic
for k = 1:10^4-1
fact = exp(1j*(k+1)) - exp(1j*k);
c{k} = gather(expm(fact*a));
end
wait(dev) %Wait for all GPU computations to complete before calling toc
toc
Elapsed time is 224.915441 seconds.
Summary on my machine
For 10^2-1 = 99 iterations
- 58.9s (Original code)
- 28.9s (Algorithm change, double precision, parfor loop with process pool)
- 26.3s (Algorithm change, double precision, for-loop)
- 20.2s (Algorithm change, double precision, parfor loop with threadpool)
- 13.9s (Algorithm change, single precision, for-loop)
- 9s (Algorithm change, single precision, parfor loop with threadpool)
- 1.96s (Algorithm change, single precision, GPU)
More Answers (0)
See Also
Categories
Find more on GPU Computing in MATLAB in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!