Is my mathematical problem to hard for single GPU computation?

2 views (last 30 days)
I am trying to solve mathematically interesting problem from X-ray phase contrast imaging or X-ray grating interferometry also known as X-ray Talbot inteferometry. I have M=5 images for foreground and M=5 images for background. In each pixel, I have M=5 intensity values for foreground and M=5 intensity values for background. These M=5 intensity values for each pixel lies on the sinus curve. I am trying to fit these sinus curves for each pixel with Levenberg-Marquardt non-linear sinusoidal curve fit. I wrote simple algorithm or function to do that called levmar_sinus. So, I am not using Levenberg-Marquardt curve fit algorithm from Matlab Optimization toolbox. I do fitting of sinusoidal curve for each pixel separately for foreground and background. From fitting, I know phase value for each pixel for foreground image and background image. By subtracting phase values for background image from phase values for foreground image, I get final so-called differentail phase image, which is my result. I tried to run this program only on the CPU (i7-8750H CPU @ 2.20GHz), and it took 300 seconds to do calculation. I posted this algorithm on the file exchange under my name. It took 300 seconds on the CPU, because it is 1536 x 512 pixels = 786432 pixels. A lot of pixels. I re-wrote the algorithm for GPU. The problem presented here is the ideal problem for paralleization using GPU, beacuse you can run sinusoidal fit for each pixel on the different processor of GPU card. Please see algorith below, and it was computational disaster. It took to calculate forground image for 55541 seconds and background image for 55432 seconds. Together almost 31 hours. I used parfor loops to do paralle comuping on the GPU, as below. My GPU card is NVIDIA GeForce GTX 1080 Ti. Not so bad GPU. My question is. Do I have any problem or mistake in my algorithm? Or this mathematical problem, which I present here is so computational exhausted that The CPU computation is better choice over the GPU computation. Even if my problem seems to be ideal for parallel computation. Here, I used only single GPU, so, number of workers was 1. It would be nice if somebody from Mathworks staff could write me his or her opinion about my problem.
%**************************************************************************
% this program opens M=5 images for forground and background
% measured by X-ray talbot interferometry
% and calculate differential phase (dph) image
% using sinusoidal Levenberg-Marquardt algorithm fit by sinus function
% on the GPU
% Author: Karol Vegso
% Affiliation: Institute of Physics, Slovak Academy of Sciences
%**************************************************************************
clear all
close all
%**************************************************************************
% path to folder with forground images
path_to_fg_images='d:\Matlab\Xray_Talbot_Interferometry\XTI_dph_image_calculator_sinus_fitting_GPU\';
% path to background images
path_to_bg_images='d:\Matlab\Xray_Talbot_Interferometry\XTI_dph_image_calculator_sinus_fitting_GPU\';
% path to output folder to save final differentila phase image
path_to_output_folder='d:\Matlab\Xray_Talbot_Interferometry\XTI_dph_image_calculator_sinus_fitting_GPU\final_image\';
% number of images in fringe scanning technique
% the same number for forground and background
M=5;
% open forground images and store them in 3D matrix
% root image name for forground
root_image_name_fg='fg';
% root image name for background
root_image_name_bg='bg';
% number of digits in image numbering
number_digits=6;
% size of image
% size of image in horizontal direction, number of columns
image_size_cols=1536;
% size of image in vertical direction, number of rows
image_size_rows=512;
% image size
image_size=image_size_cols*image_size_rows;
% precision
% here, it is unsigned 16 bit integer
precision='uint16';
% order of reading bytes
order_read_bytes='ieee-le';
% create image buffer to store single forground image
image_buffer_fg=zeros(image_size, M);
% process M forground images
for index_0=1:M
% create image_number as string
image_number=num2str(index_0);
% add number digits
image_number=pad(image_number, number_digits, 'left');
% replace empty spaces with zeros
image_number=replace(image_number,' ', '0');
% full path to forground image
path_to_fg_image=strcat(path_to_fg_images, root_image_name_fg, image_number,'.raw');
% create fileID
fileID=fopen(path_to_fg_image);
% read image
image_buffer_fg(:,index_0)=fread(fileID, image_size, precision, order_read_bytes);
% close fileID
fclose(fileID);
end
%**************************************************************************
% create image buffer to store single background image
image_buffer_bg=zeros(image_size, M);
% process M background images
for index_0=1:M
% create image_number as string
image_number=num2str(index_0);
% add number digits
image_number=pad(image_number, number_digits, 'left');
% replace empty spaces with zeros
image_number=replace(image_number,' ', '0');
% full path to background image
path_to_bg_image=strcat(path_to_bg_images, root_image_name_bg, image_number,'.raw');
% create fileID
fileID=fopen(path_to_bg_image);
% read image
image_buffer_bg(:,index_0)=fread(fileID, image_size, precision, order_read_bytes);
% close fileID
fclose(fileID);
end
%**************************************************************************
% identify number of available GPUs
availableGPUs=gpuDeviceCount('available');
% create parallel pool on available GPU
parpool('Processes',availableGPUs);
%**************************************************************************
% send forground and backround image data from phase-stepping to GPU
% send foreground image data to GPU
image_buffer_fg_GPU=gpuArray(image_buffer_fg(:,:));
% send background image data to GPU
image_buffer_bg_GPU=gpuArray(image_buffer_bg(:,:));
%**************************************************************************
% calculate differential phase image
% define phase step
phase_step=(2*pi)/M;
% generate phase coordinates for sinusidal fitting
phase=zeros(1,M);
for index_0=1:M
phase(1,index_0)=(index_0-1)*phase_step;
end
%**************************************************************************
% send phase data vector to the GPU
phase_GPU=gpuArray(phase(1,:));
%**************************************************************************
% phase image for forground initialized on the GPU
phase_image_fg_GPU=zeros(image_size,1,"gpuArray");
% phase image for background initialized on the GPU
phase_image_bg_GPU=zeros(image_size,1,"gpuArray");
% initiliaze differential phase on the GPU
dph_image_GPU=zeros(image_size,1,"gpuArray");
%**************************************************************************
% perform sinusiodal fit on the foreground data
%**************************************************************************
% maximum number of iterations
k_max=1000;
% transfer k_max to the GPU
k_max_GPU=gpuArray(k_max);
% epsilon 1
epsilon_1=1.0e-8;
% transfer epsilon_1 to the GPU
epsilon_1_GPU=gpuArray(epsilon_1);
% epsilon 2
epsilon_2=1.0e-8;
% transfer epsilon_2 to the GPU
epsilon_2_GPU=gpuArray(epsilon_2);
% tau
tau=1.0e-3;
% transfer tau to the GPU
tau_GPU=gpuArray(tau);
%**************************************************************************
% perform sinusiodal fitting pixel by pixel on the foreground data on the
% GPU
%**************************************************************************
tic
parfor index_0=1:image_size
%tic
% print_index_0
%index_0
% initialize y_data
y_data=zeros(1,M);
% initialize fit_result
fit_result=zeros(1,3);
% fill y_data
y_data(1,:)=image_buffer_fg_GPU(index_0,:);
% initial parameters
% amplitude
x01=(max(y_data(1,:))-min(y_data(1,:)))/2;
% phase shift
x02=pi/2;
% offset
x03=mean(y_data(1,:));
% initial vector
x0=[x01 x02 x03];
% fill fit_result
fit_result=levmar_sinus(phase_GPU, y_data, x0, k_max_GPU, epsilon_1_GPU, epsilon_2_GPU, tau_GPU);
%fit_result=arrayfun(@levmar_sinus, phase_GPU, y_data, x0, k_max_GPU, epsilon_1_GPU, epsilon_2_GPU, tau_GPU);
phase_image_fg_GPU(index_0,1)=fit_result(1,2);
%toc
end
toc
%**************************************************************************
% perform sinusiodal fit on the background data
%**************************************************************************
% maximum number of iterations
k_max=1000;
% transfer k_max to the GPU
k_max_GPU=gpuArray(k_max);
% epsilon 1
epsilon_1=1.0e-8;
% transfer epsilon_1 to the GPU
epsilon_1_GPU=gpuArray(epsilon_1);
% epsilon 2
epsilon_2=1.0e-8;
% transfer epsilon_2 to the GPU
epsilon_2_GPU=gpuArray(epsilon_2);
% tau
tau=1.0e-3;
% transfer tau to the GPU
tau_GPU=gpuArray(tau);
%**************************************************************************
% perform sinusiodal fitting pixel by pixel on the background data
%**************************************************************************
tic
parfor index_0=1:image_size
%tic
% print index_0
%index_0
% initialize y_data
y_data=zeros(1,M);
% initialize fit_result
fit_result=zeros(1,3);
% fill y_data
y_data(1,:)=image_buffer_bg_GPU(index_0,:);
% initial parameters
% amplitude
x01=(max(y_data(1,:))-min(y_data(1,:)))/2;
% phase shift
x02=pi/2;
% offset
x03=mean(y_data(1,:));
% initial vector
x0=[x01 x02 x03];
% fill fit_result
fit_result=levmar_sinus(phase_GPU, y_data, x0, k_max_GPU, epsilon_1_GPU, epsilon_2_GPU, tau_GPU);
%fit_result=arrayfun(@levmar_sinus, phase_GPU, y_data, x0, k_max_GPU, epsilon_1_GPU, epsilon_2_GPU, tau_GPU);
phase_image_bg_GPU(index_0,1)=fit_result(1,2);
%toc
end
toc
%**************************************************************************
% calculate final differential phase image
dph_image_GPU=phase_image_fg_GPU(:,1)-phase_image_bg_GPU(:,1);
% transfer differential phase image or dph image from the GPU to the CPU
dph_image_CPU=gather(dph_image_GPU(:,1));
% reshape linear image to the 2D image form
dph_image_CPU_reshaped=reshape(dph_image_CPU(:,1),[image_size_cols,image_size_rows]);
% transpose reshaped image on the CPU
dph_image_CPU_reshaped=transpose(dph_image_CPU_reshaped(:,:));
% show final differential phase image
imshow(dph_image_CPU_reshaped(:,:), []);
% path to final differential phase image or dph image
path_to_final_dph_image=strcat(path_to_output_folder, 'dph.raw');
% create fileID
fileID=fopen(path_to_final_dph_image, 'w');
% save differential phase image as 64bit real image with little bit endian
% ordering
fwrite(fileID,dph_image_CPU,'double', 'ieee-le.l64');
% close fileID
fclose(fileID);
%**************************************************************************
% delete gcp
delete(gcp('nocreate'));
%**************************************************************************
% end of program
%**************************************************************************
%**************************************************************************
% beginning of function for sinusidal fitting
%**************************************************************************
function [x_new] = levmar_sinus(t_data, y_data, x0, k_max, epsilon_1, epsilon_2, tau)
%**************************************************************************
% beginning of algorithm
%**************************************************************************
% initial iteration variable
k=0;
ni=2;
M=length(t_data(1,:));
% define Jacobian matrix
J=zeros(M,3);
% fill Jacobian matrix
J(:,1)=(-1.0)*sin(t_data(1,:)+x0(1,2));
J(:,2)=(-1.0)*x0(1,1)*cos(t_data(1,:)+x0(1,2));
J(:,3)=-1;
% calculate A matrix
A(:,:)=transpose(J(:,:))*J(:,:);
% caclulate function f
f=zeros(1,M);
f=y_data(1,:)-x0(1,1)*sin(t_data(1,:)+x0(1,2))-x0(1,3);
% calculate g
g=transpose(J(:,:))*transpose(f(1,:));
% calculate g norm
g_norm=sqrt(sum(g(:,1).*g(:,1)));
% boolean variable
found_bool=(g_norm <= epsilon_1);
% define mi
mi=tau*max(diag(A(:,:)));
% initialize x vector
x=x0(1,:);
while ((~found_bool) & (k < k_max))
% increase iteration by one
k=k+1;
B=A+mi*eye(3);
h_lm=(-1.0)*transpose(g)*inv(B);
h_lm_norm=sqrt(sum(h_lm(1,:).*h_lm(1,:)));
x_norm=sqrt(sum(x(1,:).*x(1,:)));
if (h_lm_norm <= epsilon_2*(x_norm+epsilon_2))
found_bool=1;
else
x_new=x(1,:)+h_lm(1,:);
% calculate F(x)
% caclulate function f
f=zeros(1,M);
f=y_data(1,:)-x(1,1)*sin(t_data(1,:)+x(1,2))-x(1,3);
F_x=0.5*sum(f(1,:).*f(1,:));
% calculate F(x_new)
f=zeros(1,M);
f=y_data(1,:)-x_new(1,1)*sin(t_data(1,:)+x_new(1,2))-x_new(1,3);
F_x_new=0.5*sum(f(1,:).*f(1,:));
% ro denominator
ro_denominator=0.5*h_lm(1,:)*(mi*transpose(h_lm(1,:))-g(:,1));
% calculate ro - gain ratio
ro=(F_x-F_x_new)/ro_denominator;
if (ro > 0)
x=x_new(1,:);
% define Jacobian matrix
J=zeros(M,3);
% fill Jacobian matrix
J(:,1)=(-1.0)*sin(t_data(1,:)+x(1,2));
J(:,2)=(-1.0)*x(1,1)*cos(t_data(1,:)+x(1,2));
J(:,3)=-1;
% calculate A matrix
A(:,:)=transpose(J(:,:))*J(:,:);
% caclulate function f
f=zeros(1,M);
f=y_data(1,:)-x(1,1)*sin(t_data(1,:)+x(1,2))-x(1,3);
% calculate g
g=transpose(J(:,:))*transpose(f(1,:));
% calculate g norm
g_norm=sqrt(sum(g(:,1).*g(:,1)));
% boolean variable
found_bool=(g_norm <= epsilon_1);
% calculate mi
mi=mi*max([(1/3) (1-(2*ro-1)^3)]);
% define ni
ni=2;
else
% calculate mi
mi=mi*ni;
% calculate ni
ni=2*ni;
end
end
end
%**************************************************************************
% end of algorithm
%**************************************************************************
end
%**************************************************************************
% End of function for sinusiodal fit
%**************************************************************************

Accepted Answer

Walter Roberson
Walter Roberson on 2 Feb 2023
You have one GPU. When you try to use that one GPU from multiple workers, then each time a different worker gets control, it has to reset the GPU .
The problem presented here is the ideal problem for paralleization using GPU, beacuse you can run sinusoidal fit for each pixel on the different processor of GPU card.
It is common for people to believe that each core of an NVIDIA GPU runs independently, but that is not the case. The "cores" for NVIDIA GPUs are not able to decode instructions, or make decisions about what to do. The instruction decoding and control logic is handled by components that NVIDIA refers to as SMs. The 1080 Ti has 28 SMs -- so it can handle up to 28 different tasks simultaneously in theory. However, unless you want to write your own GPU kernels, you cannot command those yourself. The gpuArray and associated layer of Parallel Computing Toolbox takes care of assigning SMs, and is potentially already running several GPU tasks simultaneously.

More Answers (0)

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!