Divide a data set into 4 parts so that the sum of each part 1/4th of the total

7 views (last 30 days)
I want to divide a data set into four groups such that the sum of elements of each group is approximately same.
for eg: [10, 5, 1, 20, 5, 22, 4, 15]
For the above data set: sum of all the elements = 82
So, I want this data set to be divided into 4 groups such that, the sum of elements of each group is almost same.
One such possibility is
Set 1: 10, 5, 4,1
Set 2: 20
Set 3: 22
Set 4: 15,5
How do I set up this?

Accepted Answer

Image Analyst
Image Analyst on 16 Jun 2019
Edited: Image Analyst on 16 Jun 2019
I'd just sort them and then take the CDF and look for percentages:
c = cumsum(sort(data, 'ascend'));
c = c / c(end); % Normalize from 0 to 1
c25 = find(c>0.25, 1, 'first');
c50 = find(c>0.5, 1, 'first');
c75 = find(c>0.75, 1, 'first');
At least that's one way that might work, though it would work best for lots of data rather than just a few elements like you have.
  4 Comments
Nagendra Reddy
Nagendra Reddy on 16 Jun 2019
Edited: Nagendra Reddy on 16 Jun 2019
I am really clueless of what 4 sets your code is suggesting. Could you please tell me.
If I am not wrong it is suggesting the following 3 sets
1, 4, 5, 5, 10
15, 20
22
Image Analyst
Image Analyst on 16 Jun 2019
Try this:
data = [10, 5, 1, 20, 5, 22, 4, 15]
sortedc = sort(data, 'ascend');
c = cumsum(sortedc);
c = c / c(end); % Normalize from 0 to 1
c25 = find(c < 0.25, 1, 'last')
c50 = find(c < 0.5, 1, 'last')
c75 = find(c < 0.75, 1, 'last')
group1 = sortedc(1:c25);
group2 = sortedc(c25+1:c50);
group3 = sortedc(c50+1:c75);
group4 = sortedc(c75+1:end);
sumOfGroup1 = sum(group1)
sumOfGroup2 = sum(group2)
sumOfGroup3 = sum(group3)
sumOfGroup4 = sum(group4)
fprintf('The sum of group 1 is %d = %.5f%%\n', sumOfGroup1, 100 * sumOfGroup1 / sum(sortedc));
fprintf('The sum of group 2 is %d = %.5f%%\n', sumOfGroup2, 100 * sumOfGroup2 / sum(sortedc));
fprintf('The sum of group 3 is %d = %.5f%%\n', sumOfGroup3, 100 * sumOfGroup3 / sum(sortedc));
fprintf('The sum of group 4 is %d = %.5f%%\n', sumOfGroup4, 100 * sumOfGroup4 / sum(sortedc));
You get
group1 =
1 4 5 5
group2 =
10 15
group3 =
20
group4 =
22
The sum of group 1 is 15 = 18.29268%
The sum of group 2 is 25 = 30.48780%
The sum of group 3 is 20 = 24.39024%
The sum of group 4 is 22 = 26.82927%
but for a much larger set, it's better:
numElements = 100000;
maxValue = 99;
data = randi(maxValue, 1, numElements);
sortedc = sort(data, 'ascend');
c = cumsum(sortedc);
c = c / c(end); % Normalize from 0 to 1
c25 = find(c < 0.25, 1, 'last')
c50 = find(c < 0.5, 1, 'last')
c75 = find(c < 0.75, 1, 'last')
group1 = sortedc(1:c25);
group2 = sortedc(c25+1:c50);
group3 = sortedc(c50+1:c75);
group4 = sortedc(c75+1:end);
sumOfGroup1 = sum(group1)
sumOfGroup2 = sum(group2)
sumOfGroup3 = sum(group3)
sumOfGroup4 = sum(group4)
fprintf('The sum of group 1 is %d = %.5f%%\n', sumOfGroup1, 100 * sumOfGroup1 / sum(sortedc));
fprintf('The sum of group 2 is %d = %.5f%%\n', sumOfGroup2, 100 * sumOfGroup2 / sum(sortedc));
fprintf('The sum of group 3 is %d = %.5f%%\n', sumOfGroup3, 100 * sumOfGroup3 / sum(sortedc));
fprintf('The sum of group 4 is %d = %.5f%%\n', sumOfGroup4, 100 * sumOfGroup4 / sum(sortedc));
The sum of group 1 is 1250676 = 24.99972%
The sum of group 2 is 1250679 = 24.99978%
The sum of group 3 is 1250651 = 24.99922%
The sum of group 4 is 1250755 = 25.00129%
If the accuracy of the CDF method is not accurate enough for your small groups then I think the one approach you might take is to just take every single permutation and check which had the average absolute deviation closest to 25%. I don't have code for that and probably won't write any. I'm assuming you just gave a very small set of data just for a simple example and that your actual data is much larger. Good luck.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!