PARFOR loop is too slow

1 view (last 30 days)
rambiz
rambiz on 25 Jun 2013
Hi all, this is the code I wrote to simulate 8PSK in Rayleigh Channel. It is working and gives the correct results but it is too slow, compared to the case when I use an ordinary for-loop and run the code on only one core. Right now I run the "parfor-loop-code" on 4 cores. Is it because, I use too much memory for the all the {k}-indexed variables? How can I avoid this problem? TYVM
% 1 Transmit Antenna
% Nr Receive Antenna
% 3 bits per Transmission
clc;
clear all;
Nt = 1; % Tx antennas
Nr = 1; % Rx antennas
Ebit_N0_dB =0;
Es_N0_dec =1/Nr*3*10.^(Ebit_N0_dB/10); %Multiply by 3 because Symbol Energy is trice as much as Bit Energy for 3-Bit-Transmission
%divide by 1/Nr to normalize the received signal energy
%in other words every receiver should receive 1/Nr of the total energy received at the receiver side
BER = zeros(1,length(Ebit_N0_dB));
symbol_errors = zeros(1,length(Ebit_N0_dB));
bit_errors = zeros(1,length(Ebit_N0_dB));
% Defining the Gray Coded Constellation
demap =[0 0 0;0 0 1;0 1 0;0 1 1;1 0 0;1 0 1;1 1 0;1 1 1];
% 8PSK Symbols
S000=1;
S001=exp(1j*1*pi/4);
S010=exp(1j*3*pi/4);
S011=exp(1j*2*pi/4);
S100=exp(1j*7*pi/4);
S101=exp(1j*6*pi/4);
S110=exp(1j*4*pi/4);
S111=exp(1j*5*pi/4);
N=10^6;
symbols=0;
xjq=cell(1,N);
argmin=cell(1,N);
tic
parfor K=1:N
symbols=symbols+1
bin_ip{K} = round(rand(1,3)); % Generate Binary Input
%-------------------------------
if bin_ip{K}==[0 0 0]
xjq{K}=S000
elseif bin_ip{K}==[0 0 1]
xjq{K}=S001
elseif bin_ip{K}==[0 1 0]
xjq{K}=S010
elseif bin_ip{K}==[0 1 1]
xjq{K}=S011
elseif bin_ip{K}==[1 0 0]
xjq{K}=S100
elseif bin_ip{K}==[1 0 1]
xjq{K}=S101
elseif bin_ip{K}==[1 1 0]
xjq{K}=S110
elseif bin_ip{K}==[1 1 1]
xjq{K}=S111
end
n{K} = (1/sqrt(2))*(randn(Nr,1) + 1i*randn(Nr,1)); % white Gaussian Noise with variance of 1
%==================================================================
%Generate Channel Matrix
%==================================================================
H{K}=(1/sqrt(2)*( randn(Nr,Nt)+1i*randn(Nr,Nt)));
y{K} = sqrt(Es_N0_dec)*H{K}*xjq{K} + n{K}; %=================================================================
%Optimal Detection Scheme
=================================================================
p = sqrt(Es_N0_dec); %8PSK
y000{K}=H{K}*p*S000; argmin{K}(1)=norm(y{K}-y000{K});
y001{K}=H{K}*p*S001; argmin{K}(2)=norm(y{K}-y001{K});
y010{K}=H{K}*p*S010; argmin{K}(3)=norm(y{K}-y010{K});
y011{K}=H{K}*p*S011; argmin{K}(4)=norm(y{K}-y011{K});
y100{K}=H{K}*p*S100; argmin{K}(5)=norm(y{K}-y100{K});
y101{K}=H{K}*p*S101; argmin{K}(6)=norm(y{K}-y101{K});
y110{K}=H{K}*p*S110; argmin{K}(7)=norm(y{K}-y110{K});
y111{K}=H{K}*p*S111; argmin{K}(8)=norm(y{K}-y111{K});
[~,est(K)] = min(argmin{K});
est_bin{K} = demap(est(K),:);
%Count bit errors
bit_errors = bit_errors + length(find( est_bin{K}~=bin_ip{K}));
end
toc
BER = bit_errors/(symbols*3)

Answers (2)

Matt J
Matt J on 25 Jun 2013
Edited: Matt J on 25 Jun 2013
You have a lot of on-the-fly memory allocation, which probably isn't helping things. H{k},y{k}, y000{k},etc... are not pre-allocated and get generated element-by-element in the loop. The contents of each argmin{K}, apparently a length 8 vector, are also not pre-allocated. Finally, it doesn't seem to make much sense holding these things in cell arrays. It looks like they all contain same-sized quantities and could be held instead in numeric arrays.
As another miscellaneous note, you can make things faster by skipping calls to find(),
bit_errors = bit_errors + sum( est_bin{K}~=bin_ip{K} );
  3 Comments
Matt J
Matt J on 25 Jun 2013
sum(.) is definitely slower!
No way.
x=rand(1e8,1)>.5;
tic;
sum(x);
toc
%Elapsed time is 0.053452 seconds.
tic;
length(find(x));
toc
%Elapsed time is 0.904927 seconds.
rambiz
rambiz on 21 Jul 2013
Edited: rambiz on 21 Jul 2013
Hi guys, tyvm for your comments so far. OK, I changed the original code, taking Matt's suggestions into consideration, i.e. I preallocated memory and used normal arrays instead of cells. The code runs in about 105 seconds (for 10^5 runs) and it is still very slow. Using sum(.) or length(find(.)) doesn't change the speed considerably. Any new comments, please?!
% Nt Transmit Antenna
% Nr Receive Antenna
% 3 bits per Transmission
% clc;
clear all;
tic
Nt = 1; % Tx antennas
Nr = 1; % Rx antennas
Ebit_N0_dB =0;
Es_N0_dec =1/Nr*3*10.^(Ebit_N0_dB/10); %Multiply by 3 because Symbol Energy is trice as much as Bit Energy for 3-Bit-Transmission
%divide by 1/Nr to normalize the received signal energy
%in other words every receiver should receive 1/Nr of the total energy received at the receiver side
BER = zeros(1,length(Ebit_N0_dB));
symbol_errors = zeros(1,length(Ebit_N0_dB));
bit_errors = zeros(1,length(Ebit_N0_dB));
% symbols = zeros(1,length(Ebit_N0_dB));
% Defining the Gray Coded Constellation
demap =[0 0 0;0 0 1;0 1 0;0 1 1;1 0 0;1 0 1;1 1 0;1 1 1];
% % 8PSK Symbols
S000=1;
S001=exp(1j*1*pi/4);
S010=exp(1j*3*pi/4);
S011=exp(1j*2*pi/4);
S100=exp(1j*7*pi/4);
S101=exp(1j*6*pi/4);
S110=exp(1j*4*pi/4);
S111=exp(1j*5*pi/4);
N=5*10^5;
symbols=0;
bin_ip=round(rand(N,3));
xjq=zeros(1,N);
n= (1/sqrt(2))*(randn(Nr,N) + 1i*randn(Nr,N));
H=(1/sqrt(2)*( randn(Nr,Nt,N)+1i*randn(Nr,Nt,N)));
y=zeros(Nr,N);
y000=zeros(Nr,N);
y001=zeros(Nr,N);
y010=zeros(Nr,N);
y011=zeros(Nr,N);
y100=zeros(Nr,N);
y101=zeros(Nr,N);
y110=zeros(Nr,N);
y111=zeros(Nr,N);
argmin=zeros(8,N);
parfor K=1:N
symbols=symbols+1;
%-------------------------------
if bin_ip(K,:)==[0 0 0]
xjq(K)=S000;
elseif bin_ip(K,:)==[0 0 1]
xjq(K)=S001;
elseif bin_ip(K,:)==[0 1 0]
xjq(K)=S010;
elseif bin_ip(K,:)==[0 1 1]
xjq(K)=S011;
elseif bin_ip(K,:)==[1 0 0]
xjq(K)=S100;
elseif bin_ip(K,:)==[1 0 1]
xjq(K)=S101;
elseif bin_ip(K,:)==[1 1 0]
xjq(K)=S110;
elseif bin_ip(K,:)==[1 1 1]
xjq(K)=S111;
end
y(:,K) = sqrt(Es_N0_dec)*H(:,:,K)*xjq(K) + n(:,K); %8PSK
%Optimal Detection Scheme
% =============================================================
p = sqrt(Es_N0_dec); %8PSK
y000(:,K)=H(:,:,K)*p*S000;
y001(:,K)=H(:,:,K)*p*S001;
y010(:,K)=H(:,:,K)*p*S010;
y011(:,K)=H(:,:,K)*p*S011;
y100(:,K)=H(:,:,K)*p*S100;
y101(:,K)=H(:,:,K)*p*S101;
y110(:,K)=H(:,:,K)*p*S110;
y111(:,K)=H(:,:,K)*p*S111;
argmin(:,K)=[norm(y(:,K)-y000(:,K));
norm(y(:,K)-y001(:,K));
norm(y(:,K)-y010(:,K));
norm(y(:,K)-y011(:,K));
norm(y(:,K)-y100(:,K));
norm(y(:,K)-y101(:,K));
norm(y(:,K)-y110(:,K));
norm(y(:,K)-y111(:,K))];
[~,est(K)] = min(argmin(:,K));
est_bin{K} = demap(est(K),:);
%Count bit errors
bit_errors = bit_errors + length(find( est_bin{K}~=bin_ip(K,:)));
%bit_errors = bit_errors + sum( est_bin{K}~=bin_ip(K,:));
end
toc
BER = bit_errors/(symbols*3)

Sign in to comment.


Muthu Annamalai
Muthu Annamalai on 25 Jun 2013
Edited: Muthu Annamalai on 25 Jun 2013
Key to using parfor must be the independence of each iteration; i.e. you cannot write variable updates like,
a = a + 1
So change the lines,
bit_errors = bit_errors + length(find( est_bin{K}~=bin_ip{K}));
bit_errors(k) = length(find( est_bin{K}~=bin_ip{K}));
and rewrite
BER = bit_errors/(symbols*3)
as
BER = sum(bit_errors)/(symbols*3)
and remove the line,
symbols=symbols+1
because this variable is going to be N at end of all iterations.
This should give you some speedup.
  2 Comments
Matt J
Matt J on 25 Jun 2013
you cannot write variable updates like a=a+1
Why not? Don't they qualify as legitimate reduction assignments as described here?
Muthu Annamalai
Muthu Annamalai on 25 Jun 2013
@Matt I probably stand corrected in that case, while still my approach is explicitly parallelizable.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!