removing specified data from variable

6 views (last 30 days)
C.G.
C.G. on 7 Nov 2022
Commented: Davide Masiello on 7 Nov 2022
I have a 100x2 dataset I am working with. I also have 2 random distributions of data.
I want to modify my original dataset in the following way:
  1. randomly generate a number from random distribution 1 and keep this many rows of the data.
  2. randomly generate a number from random distribution 2 and remove this many rows of the data
I want to do this for the full length of the dataset.
Can anybody help me define this?
time = [1:1:100];
var = rand(100,1);
data = [time' var]; %dataset
dist1 = 1 + (20-1).*rand(100,1); %random distribution 1
dist2 = 10 + (30-10).*rand(100,1); %random distribution 2
position1 = randi(length(dist1));
card1 = dist1(position);
position2 = randi(length(dist2))l
card2 = dist2(position);
  8 Comments
Davide Masiello
Davide Masiello on 7 Nov 2022
Is there a maximum amount of rows that are kept or removed each time?
C.G.
C.G. on 7 Nov 2022
It is all dependant on the number that is generated from the random distribution each time. The maximum number that could be kept or removed is the maximum number in dist1 or dist2

Sign in to comment.

Answers (2)

Davide Masiello
Davide Masiello on 7 Nov 2022
Edited: Davide Masiello on 7 Nov 2022
I think the following code is a simpler way of achieving your task, but it does not implement the "pulling a number from a random distribution", because honestly I still do not understand what that would be for.
Instead, at each iteration it generates a random integer (max 20) and that would be the new increment of rows to either keep or remove.
See below the code with printed text describing the action at each iteration.
data = [(1:100)' rand(100,1)] % Dataset
data = 100×2
1.0000 0.8919 2.0000 0.9869 3.0000 0.7180 4.0000 0.3318 5.0000 0.7077 6.0000 0.9559 7.0000 0.8753 8.0000 0.0893 9.0000 0.9560 10.0000 0.7366
datanew = [];
distribution1 = randi(100,100,1); % Array of random integers (to be replaced with gaussian distribution later)
distribution2 = randi(100,100,1); % Array of random integers (to be replaced with gaussian distribution later)
index = 0;
iter = 1;
while index < size(data,1)
fprintf('This is iteration number %d.\n',iter)
if isequal(mod(iter,2),1)
increment = min(distribution1(randi(length(distribution1),1,1)),size(data,1)-index);
fprintf('The random number is %d.\n',increment)
fprintf('We keep the rows between %d and %d.\n',[index+1,index+increment])
datanew = [datanew;data(index+1:index+increment,:)];
else
increment = min(distribution2(randi(length(distribution2),1,1)),size(data,1)-index);
fprintf('The random number is %d.\n',increment)
fprintf('The rows between %d and %d do not get added to the new dataset.\n',[index+1,index+increment])
end
iter = iter+1;
index = index+increment;
end
This is iteration number 1.
The random number is 3.
We keep the rows between 1 and 3.
This is iteration number 2.
The random number is 52.
The rows between 4 and 55 do not get added to the new dataset.
This is iteration number 3.
The random number is 45.
We keep the rows between 56 and 100.
size(data)
ans = 1×2
100 2
size(datanew)
ans = 1×2
48 2
  5 Comments
Davide Masiello
Davide Masiello on 7 Nov 2022
But why do you first generate a random distribution and then randomly take a value from it?
How is this different from just generating a random number.
I.e.
how is this
distribution1 = randi(10,100,1); % array of 100 random integers from (max val. = 10)
a = distribution1(randi(100,1,1)) % integer randomly pulled from distribution 1
a = 7
different from this
a = randi(10,1,1) % random integer between 1 and 10
a = 2
Davide Masiello
Davide Masiello on 7 Nov 2022
Ok I see now, sorry I must have skipped that part.
I have modified my answer so that the number of rows to keep/remove is pulled randomly from the vectors which I called distribution1 and distribution2.
These are random vectors, you can replace them with the gaussian distributions at your discretion.

Sign in to comment.


C.G.
C.G. on 7 Nov 2022
Edited: C.G. on 7 Nov 2022
As I said, I eventually want to choose from a gaussian distribution. This would mean I would get a large amount of values in the middle range, and few occurances of v.small/large numbers. Currently by just generating a random distribution it could give me any number at any time.
I'm currently just tyring to get the code working and getting numbers from the right places and then I can change from there. I apologise if this has been confusing.
Eventually,
I want 2 distributions e.g.
dist1 = 1 + (10-1).*rand(100,1); %small numbers
dist2 = 50 + (100-1).*rand(100,1); %big numbers
then the rows to keep would be selected from the small number distribution, and the rows to get rid of would be selected from the big number distribution

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!