Improve Speed of Matrix Multiplication by Taking Advantage of GPUArray

I actually want to improve the speed of the multiplication of matrix A (m-by-n) and vector x (n-by-1) using GPUArray. Using:
gpuA = gpuArray(A);
gpux = gpuArray(x);
gpuC = gpuA * gpux
C = gather(gpuC)
didn't give me the needed improvement in speed. Any ideas on how I can achieve this?
Edric Ellis
Edric Ellis on 19 Jun 2015
How large is A? What GPU device do you have? Also, note that ideally you'd compare timings for only the matrix multiplication portion - you shouldn't time the data transfers too as you shouldn't need to keep transferring data. I would compare:
timeit(@() A * x);
gputimeit(@() gpuA * gpux);
On my K20 GPU I see break-even when A is about 1000x1000.

