Fastest way to dot product all columns of two matricies of same size
18 views (last 30 days)
Show older comments
I have come up against this problem over and over, and I have a nice solution, but it seems non-optimal from a speed sense. Does anybody have a better way?
Problem: I will typically have two matricies of the same size (A and B) that have a small number of rows (m ~ 5-10) and a large number of columns (n ~ 1000) (think of each row is a "state" and each column representing the state value at a point in time). Is there a nice efficient way to take the corresponding columns of each matrix and dot product them to end up with a row vector of length n?
My current solution is:
C = sum(A.*B, 1);
The reason I don't like this is that for single column vectors
sum(v1.*v2, 1)
is about twice as slow as the regular dot product, so I am thinking there is a better way to do this for the matrix problem (it is typically executed 100's of thousands of times in my code is why I care).
Btw, looping over the columns and taking dot product is reallllllly slow, so don't bother trying anything with a for loop...
0 Comments
Answers (2)
Jan
on 20 Feb 2012
Edited: Jan
on 11 Mar 2014
A direct C-Mex can avoid the creation of the large temporary array A.*B, but collect the sum immediately:
// Compile:
// DOT as C-loop: mex -O SumAB1.c libmwblas.lib -DUSE_C_LOOP
// DOT as BLAS call: mex -O SumAB1.c libmwblas.lib
//
// Tested: Matlab 7.8, 7.13, Win7/64, Compiler: MSVC2008
// Author: Jan Simon, Heidelberg, (C) 2012 matlab.2010ATnMINUSsimonDOTde
// License: BSD
#include "mex.h"
#ifndef USE_C_LOOP
double ddot(mwSignedIndex *k, double *a, mwSignedIndex *ai,
double *b, mwSignedIndex *bi);
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N);
#else
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N);
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize M, N;
const mxArray *A, *B;
// Proper number of arguments:
if (nrhs != 2) {
mexErrMsgTxt("*** SumAB1[mex]: 2 inputs required.");
}
if (nlhs > 1) {
mexErrMsgTxt("*** SumAB1[mex]: 1 output allowed.");
}
// Pointers to inputs:
A = prhs[0];
B = prhs[1];
M = mxGetM(A);
N = mxGetN(A);
// Inputs must be double matrices of the same size:
if (!mxIsDouble(A) || mxGetNumberOfDimensions(A) != 2 ||
!mxIsDouble(B) || mxGetNumberOfDimensions(B) != 2 ||
mxGetM(B) != M || mxGetN(B) != N) {
mexErrMsgTxt("*** SumAB1[mex]: "
"Inputs must be real double matrices of the same size.");
}
// Create output:
plhs[0] = mxCreateDoubleMatrix(1, N, mxREAL);
#ifndef USE_C_LOOP
CoreBlas(mxGetPr(A), mxGetPr(B), mxGetPr(plhs[0]), M, N);
#else
CoreLoop(mxGetPr(A), mxGetPr(B), mxGetPr(plhs[0]), M, N);
#endif
return;
}
// *****************************************************************************
void CoreLoop(double *a, double *b, double *c, mwSize M, mwSize N)
{
// DOT-product as C-loop:
mwSize i, j;
double s;
for (i = 0; i < N; i++) {
s = 0.0;
for (j = 0; j < M; j++) {
s += *a++ * *b++;
}
c[i] = s;
}
return;
}
// *****************************************************************************
void CoreBlas(double *a, double *b, double *c, mwSignedIndex M, mwSignedIndex N)
{
// DOT-product as BLAS call, faster if M > 100:
mwSignedIndex i, one = 1;
for (i = 0; i < N; i++) {
c[i] = ddot(&M, a + i * M, &one, b + i * M, &one);
}
return;
}
With MSVC 2008 I get about the double speed as for sum(A.*B, 1) in Matlab 2009a/64/Win7.
[EDITED] Inserted a compiler directive to perform the DOT using the faster BLAS DDOT call.
[EDITED 2] Sorry, DDOT is faster only for about M>100, not for M=10!
A hard coded loop is 16% faster than the C-loop method above for [10 x 1000] matrices:
void CoreLoop_10(double *a, double *b, double *c, mwSize N)
{
// DOT-product as C-loop for !!! M==10 !!!:
mwSize i;
double s;
register double *a0=a, *a1=a+1, *a2=a+2, *a3=a+3, *a4=a+4,
*b0=b, *b1=b+1, *b2=b+2, *b3=b+3, *b4=b+4;
for (i = 0; i < N; i++) {
s = *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ +
*a3++ + *b3++ + *a4++ + *b4++;
s += *a0++ + *b0++ + *a1++ + *b1++ + *a2++ + *b2++ +
*a3++ + *b3++ + *a4++ + *b4++;
c[i] = s;
}
return;
}
But then you have to branch in the main function to the different lengths of M. A SSE3 approach will be faster, but you have to care for the 128 bit alignment then.
6 Comments
James Tursa
on 21 Feb 2012
@Jan: Well, in my 32-bit WinXP R2011a, DDOT is 25% slower than the hand coded loops you have above (timing for 10x1000000). Not sure what 3rd party BLAS library R2011a uses, but as I recall it is not MKL (this info used to be listed in the doc). But on your machine the BLAS is faster. This is a poster child case for why I have difficulty figuring out at runtime which method to use in MTIMESX. Regarding MTIMESX use, this would likely not help in this case since it just adds some overhead and performs the same basic loops you have above in the background. Also, there would be some reshapes involved to coax MTIMESX to do this calculation, which again adds a little bit of overhead (maybe I can put this particular calculation as a direct calculation option in a future version of MTIMESX to avoid the reshapes). One exception would be if the dot product size was very large (>>10), in which case the unrolled loops and OpenMP in MTIMESX perform a little faster than the straight loops you have above.
Jan
on 21 Feb 2012
Doh. I've read the corresponding sentence in the question three times because I've confused this too often in the forums already. Nevertheless, I've measured the timings with 1000x10 matrices instead of 10x1000.
Then the BLAS:DDOT method takes 30% more time than SUM(A.*B)!
New timings:
A = rand(10, 1000); B = rand(10, 1000);
tic; for i=1:10000; C2 = MyMexFunc(A, B); end; toc
Hard coded for M=10: 0.162 sec
The C-loop "s += *a++ * *b++": 0.196 sec
BLAS:DDOT: 0.444 sec
sum(A.*B): 0.344 sec
Jiro Doke
on 20 Feb 2012
When you say you're doing the dot product, I assume you're doing this:
v1'*v2
and not
dot(v1, v2)
The equivalent for your scenario, where you have matrices, would be something like this:
diag(A'*B)'
But that adds a lot of overhead with extra calls to diag and transpose.
Have you tried calling sum without the dimension parameter?
C = sum(A.*B);
This is slightly faster than
C = sum(A.*B, 1);
See Also
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!