Scaling of vectorized code which is limited by memory access

Question

ConvexHull on 9 Aug 2019

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/475614-scaling-of-vectorized-code-which-is-limited-by-memory-access

Commented: ConvexHull on 12 Aug 2019

Dear all,

what possibilities are there to make a vectorized code faster, where the speed is limited by memory access?

I have attached two runs on two different machines with the same vectorized MATLAB code. Here var_z_analyse is a huge array of fixed size, which will be accessed/querried by a long list of indices which may vary in size and content for each call. May parfor be a solution in this context?

For example the first line does not benefit from 4 to 18 cores, whereas the last two lines scale very well.

first machine has 4 cores on one node

seconde machine has 18 cores on one node

Regards

11 Comments
Show 9 older commentsHide 9 older comments

ConvexHull on 10 Aug 2019

Edited: ConvexHull on 10 Aug 2019

Some background:

The code is part of an quadtree look up table approach in two (three) dimensions, where i have to find the right quad containing the data in a monomial representation. The query is not based on a tree traverse, but rather based on a hash table approach.

the ids of the quads are found e.g. in line 1
to evaluate the data from the modal representation (here monomials), i have to build-up different vandermonde matrices with horner sheme, e.g. line 2-7
i calculate the two dimentional interpolant via the outerproduct (which is nothing else than Newton interpolation, but highly efficient), e.g. line 8-9

@ James Tursa:

The size of the arrays are, e.g.,

size(a_I)=[1,M] ,

size(A)=[3,3,M] ,

size(var_z_analyse)=[3,3,N] ,

size(Vandermondematrix_x)=[3,1,M] ,

size(Vandermondematrix_y)=[1,3,M] ,

size(outer_product)=[3,3,M] ,

size(u_z)=[1,M] ,

where N is fixed huge, e.g. 6000000, and M is arbitrary huge, e.g. 1000000, during runtime. Also note that 3 is the polynomial degree+1.

Does a mex-file implementation also suffer of M beeing arbitrary and what about the scaling with mex files?

@ Walter Roberson:

Thank you, i know that these operations are similary, but i dont think it will help me with this issue.

@ Bruno Luong:

As you already said, the bottleneck is not the outer product operation. The main problem is the data query.

Thank you for the discussion!

Regards

ConvexHull on 10 Aug 2019

Edited: ConvexHull on 10 Aug 2019

Open in MATLAB Online

I have implemented some stuff, yet without success!

Any suggestions?

tic
for i=1:1000
    A_mex=reshape(hash(reshape(var_z_analyse,[9,size(var_z_analyse,3)]),int64(a_I(:))),[3,3,size(a_I(:),1)]);
end
toc

4 cores: Elapsed time is 3.255311 seconds.

16 cores: Elapsed time is 3.634145 seconds.

tic
for i=1:1000
    A=var_z_analyse(:,:,a_I(:));
end
toc

4 cores: Elapsed time is 3.054769 seconds.

16 cores: Elapsed time is 3.395311 seconds.

The size of the arrays:

size(a_I)=[1,M] ,

size(A)=[3,3,M] ,

size(var_z_analyse)=[3,3,N] ,

with N=19012 and M=130000.

I have compiled the mex file hash.f90 with gfortran

mex CFLAGS='$CFLAGS -fopenmp' LDFLAGS='$LDFLAGS -fopenmp' COPTIMFLAGS='$COPTIMFLAGS -fopenmp -O2' LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -O2' DEFINES='$DEFINES -fopenmp' -v hash.F90

on linux ubuntu machine 18.04 with openmp.

#include "fintrf.h"
!==================================================================================================================================
!> Wrapper between MATLAB and Fortan90
!==================================================================================================================================
SUBROUTINE mexFunction(nlhs,plhs,nrhs,prhs)
! MODULES
! IMPLICIT VARIABLE HANDLING
IMPLICIT NONE
!----------------------------------------------------------------------------------------------------------------------------------
! INPUT / OUTPUT VARIABLES
!----------------------------------------------------------------------------------------------------------------------------------
! LOCAL VARIABLES 
mwPointer         :: mxCreateDoubleMatrix,mxGetPr
mwPointer         :: plhs(*),prhs(*)
mwPointer         :: mxGetM,mxGetN
mwPointer         :: x_pr1,y_pr1,x_pr2
mwPointer         :: m1,n1,m2,n2
mwSize            :: size1,size2,size3
INTEGER           :: nlhs,nrhs,mxIsNumeric
INTEGER,PARAMETER :: numel = 2000000
DOUBLE PRECISION  :: x1(numel),y1(numel)
INTEGER           :: x2(numel)
!==================================================================================================================================
! Check for proper number of arguments. 
IF(nrhs.NE.2)THEN
   CALL mexErrMsgIdAndTxt('MATLAB:hash:nInput','Two input required.')
ELSEIF(nlhs.NE.1)THEN
   CALL mexErrMsgIdAndTxt('MATLAB:hash:nOutput','One output required.')
END IF
! Get the size of the input arrays.
m1    = mxGetM(prhs(1))
n1    = mxGetN(prhs(1))
size1 = m1*n1
m2    = mxGetM(prhs(2))
n2    = mxGetN(prhs(2))
size2 = m2*n2
! output size
size3 = m1*n2
! Column * row must be smaller than numel
IF(size1.GT.numel)THEN
   CALL mexErrMsgIdAndTxt('MATLAB:hash:mSize','Row * column must be <= 1000.')
END IF
! Column * row must be smaller than numel
IF(size2.GT.numel)THEN
   CALL mexErrMsgIdAndTxt('MATLAB:hash:mSize','Row * column must be <= 1000.')
END IF
 
! Check that the array is numeric (not strings).
IF(mxIsNumeric(prhs(1)).EQ.0)THEN
   CALL mexErrMsgIdAndTxt('MATLAB:hash:NonNumeric','Input must be a numeric array.')
END IF
! Check that the array is numeric (not strings).
IF(mxIsNumeric(prhs(2)).EQ.0)THEN
   CALL mexErrMsgIdAndTxt('MATLAB:hash:NonNumeric','Input must be a numeric array.')
END IF
! Create matrix for the return argument.
plhs(1) = mxCreateDoubleMatrix(m1,n2,0)
x_pr1   = mxGetPr(prhs(1))
x_pr2   = mxGetPr(prhs(2))
y_pr1   = mxGetPr(plhs(1))
CALL mxCopyPtrToReal8(x_pr1,x1,size1)
CALL mxCopyPtrToInteger8(x_pr2,x2,size2)
! CALL the computational subroutine.
CALL hash(y1,x1,x2,m1,n1,m2,n2)
! Load the data into y_pr1, which is the output to MATLAB.
CALL mxCopyReal8ToPtr(y1,y_pr1,size3)     
END SUBROUTINE
!==================================================================================================================================
!> Computional Subroutine 
!==================================================================================================================================
SUBROUTINE hash(y1,x1,x2,m1,n1,m2,n2)
! MODULES
USE OMP_LIB
! IMPLICIT VARIABLE HANDLING
IMPLICIT NONE
!----------------------------------------------------------------------------------------------------------------------------------
! INPUT / OUTPUT VARIABLES
!----------------------------------------------------------------------------------------------------------------------------------
! LOCAL VARIABLES 
mwSize           :: m1,n1,m2,n2,i
DOUBLE PRECISION :: x1(m1,n1),y1(m1,n2)
INTEGER          :: x2(m2,n2)
!==================================================================================================================================
!$OMP PARALLEL 
!$OMP DO 
DO i=1,n2
    y1(:,i)= x1(:,x2(1,i))
END DO
!$OMP END DO
!$OMP END PARALLEL
END SUBROUTINE

ConvexHull on 11 Aug 2019

Edited: ConvexHull on 11 Aug 2019

Open in MATLAB Online

Thank you Bruno,

i have moved the whole stuff to the mex file. Now there is no big memory copying except filling the last array.

Here are some first results with the mex/openmp implementation with one and with four cores:

1 core mex file:

tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 9.974623 seconds.

4 cores mex file:

tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 12.943373 seconds.

vectorized matlab:

tic
for i=1:1000
... matlab code
end
toc
Elapsed time is 13.259077 seconds.

The mex file looks like:

!$OMP PARALLEL DO NUM_THREADS(4)
DO i=1,M
    ... local operation without memory copying
    uz(1,i) = ...
END DO
!$OMP END PARALLEL DO 

When is add additional dummy work to the loop

!$OMP PARALLEL DO NUM_THREADS(4)
DO i=1,M
    ... local operation without memory copying
    uz(1,i) = ...
    
    DO j=1,1000
    ... dummy work
    END DO
END DO
!$OMP END PARALLEL DO 

the results look like

1 core mex file + dummy work:

tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 51.427996 seconds.

4 cores mex file + dummy work:

tic
for i=1:1000
u_mex=hash(a_1,a_2,a_3,a_4,a_5,a_6,a_7);
end
toc
Elapsed time is 18.105696 seconds.

Good thing is now:

memory copying is gone
openmp + mex + fortran works

Bad thing is now:

the work load in my openmp loop is too small
the overhead of openmp (open, close, sync, ...) is the crucial part which limits the speed/scaling

Note that the sizes of the arrays were again N=19012 and M=130000, which is the average order of magnitude excepted.

This is a little bit frustrating! :)

More suggestions?

ConvexHull on 11 Aug 2019

Thank you Bruno,

i already have optimized the fortran code for single performance. There's nothing more to get.

Regards

ConvexHull on 11 Aug 2019

Edited: ConvexHull on 11 Aug 2019

Open in MATLAB Online

Back to the roots. The question is really simple.

We have a fixed list of data B and a list of ids C which may vary in size and content for each call.

A=B(:,:,C(:));

How to handle such stuff in Matlab/C/Fortran with or without openmp in context of speed/scaling for a medium size problem?

Sign in to comment.

Sign in to answer this question.

Answer 1

Bruno Luong on 11 Aug 2019

1
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/475614-scaling-of-vectorized-code-which-is-limited-by-memory-access#answer_387146

Edited: Bruno Luong on 11 Aug 2019

Open in MATLAB Online

Here is my C-mex implementation, on my laptop it accelerates by 20 fold from your original MATLAB code (compiled with MSVS 2017, 1 thread)

N=19012;
M=130000;
u_xi = rand(1,M);
u_eta = rand(1,M);
a_I = ceil(rand(1,M)*N);
var_z_analyse = rand(3,3,N);
tic
for i=1:1000
    V_x = zeros(3,1,M);
    V_y = zeros(1,3,M);
    A = var_z_analyse(:,:,a_I);
    V_x(2,1,:) = u_xi(:);
    V_x(1,1,:) = 1;
    V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
    V_y(1,2,:) = u_eta(:);
    V_y(1,1,:) = 1;
    V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
    outer_product = bsxfun(@times,V_x,V_y);
    u_z = sum(sum(outer_product.*A,1),2);
end
toc % 21.871430 seconds.
tic
for i=1:1000
    u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
end
toc % 1.057965 seconds.
norm(u_z_mex(:) - u_z(:)) / norm(u_z(:)) % 7.5515e-17

The hash.c C-mex code (quick and dirty without checking correctness of the input)

/*************************************************
 * CALLING:
 * >> u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
 * COMPILING:
 * >> mex -O hash.c 
 ************************************************/
#include "mex.h"
#include "matrix.h"
#define XI              prhs[0]
#define ETA             prhs[1]
#define VAR_Z_ANALYSE   prhs[2]
#define A_I             prhs[3]
#define UZ              plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
    int i,j,k,M;
    double xi_k_pow[3], eta_k, xi_k, etap, s, ss;
    double *Ak, *u_eta, *u_xi, *a_I, *uz, *var_z_analyse;
    
    var_z_analyse = (double*)mxGetData(VAR_Z_ANALYSE);
    u_eta = (double*)mxGetData(ETA);
    u_xi = (double*)mxGetData(XI);
    a_I = (double*)mxGetData(A_I);
    M = (int)mxGetNumberOfElements(A_I);
    
    UZ = mxCreateDoubleMatrix(1,M,mxREAL);
    uz = (double*)mxGetData(UZ);
    
    xi_k_pow[0] = 1;
    for (k=M; k--;)
    {
        Ak = var_z_analyse + 9*((int)(*a_I++)-1);
        
        eta_k = *u_eta++;
        
        xi_k = *u_xi++;
        xi_k_pow[1] = xi_k;
        xi_k_pow[2] = xi_k*xi_k;
        etap  = 1.0;
        s = 0.0;
        for (i=3; i--;)
        {
            ss = 0.0;
            for (j=0; j<3; j++) ss += xi_k_pow[j] * *(Ak++);
            s += etap * ss;
            etap *= eta_k;
        }
        *uz++ = s;
    }
}

6 Comments
Show 4 older commentsHide 4 older comments

Bruno Luong on 11 Aug 2019

Edited: Bruno Luong on 11 Aug 2019

Open in MATLAB Online

"I removed the zeros statement, because you dont need it in the matlab implementation and it makes the code 30% slower. "

Can't agree. Of course for the for-loop, 2nd iteration the V_x and V_y is already allocated you don't see the befefit of zeros(), but if you clear the variable V_x, V_y after using it it will become slower, since the arrays are growed when assign the third row/column .

N=19012;
M=130000;
u_xi = rand(1,M);
u_eta = rand(1,M);
a_I = ceil(rand(1,M)*N);
var_z_analyse = rand(3,3,N);
tic
for i=1:1000
    V_x = zeros(3,1,M);
    V_y = zeros(1,3,M);
    A = var_z_analyse(:,:,a_I);
    V_x(2,1,:) = u_xi(:);
    V_x(1,1,:) = 1;
    V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
    V_y(1,2,:) = u_eta(:);
    V_y(1,1,:) = 1;
    V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
    outer_product = bsxfun(@times,V_x,V_y);
    u_z = sum(sum(outer_product.*A,1),2);
end
toc % 21.665071 seconds.
tic
for i=1:1000
    A = var_z_analyse(:,:,a_I);
    V_x(2,1,:) = u_xi(:);
    V_x(1,1,:) = 1;
    V_x(3,1,:) = V_x(2,1,:).*V_x(2,1,:);
    V_y(1,2,:) = u_eta(:);
    V_y(1,1,:) = 1;
    V_y(1,3,:) = V_y(1,2,:).*V_y(1,2,:);
    outer_product = bsxfun(@times,V_x,V_y);
    u_z = sum(sum(outer_product.*A,1),2);
    clear V_x V_y % CLEAR takes normally very little time
end
toc % 26.225464 seconds.

Bruno Luong on 11 Aug 2019

Open in MATLAB Online

Horner's evaluation variant, according to my test it's even faster (about 10%)

/*************************************************
 * CALLING:
 * >> u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
 * COMPILING:
 * >> mex -O hash.c 
 ************************************************/
#include "mex.h"
#include "matrix.h"
#define XI              prhs[0]
#define ETA             prhs[1]
#define VAR_Z_ANALYSE   prhs[2]
#define A_I             prhs[3]
#define UZ              plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
    int i,j,k,M;
    double eta_k, xi_k, s, ss;
    double *Ak, *u_eta, *u_xi, *a_I, *uz, *var_z_analyse;
    
    var_z_analyse = (double*)mxGetData(VAR_Z_ANALYSE);
    u_eta = (double*)mxGetData(ETA);
    u_xi = (double*)mxGetData(XI);
    a_I = (double*)mxGetData(A_I);
    M = (int)mxGetNumberOfElements(A_I);
    
    UZ = mxCreateDoubleMatrix(1,M,mxREAL);
    uz = (double*)mxGetData(UZ);
    
    for (k=M; k--;)
    {
        Ak = var_z_analyse + 9*(int)(*a_I++);
        
        eta_k = *u_eta++;
        xi_k = *u_xi++;
        s = 0.0;
        for (i=3; i--;)
        {
            ss = *(--Ak);
            for (j=2; j--;) ss = ss*xi_k + *(--Ak);
            s = s*eta_k + ss;
        }
        *uz++ = s;
    }
}

Bruno Luong on 11 Aug 2019

Open in MATLAB Online

Folding for-loop variant code, 10% faster still

/*************************************************
 * CALLING:
 * >> u_z_mex = hash(u_xi,u_eta,var_z_analyse,a_I);
 * COMPILING:
 * >> mex -O hash.c 
 ************************************************/
#include "mex.h"
#include "matrix.h"
#define XI              prhs[0]
#define ETA             prhs[1]
#define VAR_Z_ANALYSE   prhs[2]
#define A_I             prhs[3]
#define UZ              plhs[0]
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
    int k, M;
    double eta_k, xi_k, s, ss;
    double *Ak, *u_eta, *u_xi, *a_I, *uz, *var_z_analyse;
    
    var_z_analyse = (double*)mxGetData(VAR_Z_ANALYSE);
    u_eta = (double*)mxGetData(ETA);
    u_xi = (double*)mxGetData(XI);
    a_I = (double*)mxGetData(A_I);
    M = (int)mxGetNumberOfElements(A_I);
    
    UZ = mxCreateDoubleMatrix(1,M,mxREAL);
    uz = (double*)mxGetData(UZ);
    
    for (k=M; k--;)
    {
        Ak = var_z_analyse + 9*(int)(*a_I++);      
        eta_k = *u_eta++;
        xi_k = *u_xi++;
        s = *(--Ak);
        s = s*xi_k + *(--Ak);
        s = s*xi_k + *(--Ak);
        ss = *(--Ak);
        ss = ss*xi_k + *(--Ak);
        ss = ss*xi_k + *(--Ak);
        s = s*eta_k + ss;
        ss = *(--Ak);
        ss = ss*xi_k + *(--Ak);
        ss = ss*xi_k + *(--Ak);
        *uz++ = s*eta_k + ss;
    }
}

ConvexHull on 12 Aug 2019

Well done!

Sign in to comment.

Scaling of vectorized code which is limited by memory access

11 Comments
Show 9 older commentsHide 9 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Scaling of vectorized code which is limited by memory access

11 Comments Show 9 older commentsHide 9 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

11 Comments
Show 9 older commentsHide 9 older comments

6 Comments
Show 4 older commentsHide 4 older comments