Calculating group statistics using frequency weights

10 views (last 30 days)
I have a large table of survey data with about 3 Mio. observations and 120 variables. The survey also contains a variable called "weights" with integers between 250 and 250'000. These weights are intended to render the sample representative. Hence I have to weight every observation for every calculation. I need a number of (weighted) group means, medians and percentiles.
How can I calculate summary statistics and group statistics, weighting the observations with the frequency weights?
Considering the size of the dataset, I am looking for a solution that works for a large dataset of dimensions I mentioned above. (If I needed the unweighted statistics, it can easily and efficiently be done with the groupsummary() function, I have however not found an option allowing for the weights.)
Thanks for your help
Chris
  5 Comments
clauper
clauper on 17 Dec 2019
Thanks for your comment. I have updated the question to give more context.
Adam Danz
Adam Danz on 18 Dec 2019
I added an answer that shows how to scale your inputs according to your weights.

Sign in to comment.

Answers (1)

Adam Danz
Adam Danz on 17 Dec 2019
Since you integer weights are much too large to merely replicate values based on the weights, you can scale your data according to your weights. This is only 1 of many interpretations of applying weights.
There are several ways around this and the best method depends on how you're using the weights, what those weights mean, and the bounds of those weights. That's something you'll need to think about.
Here's my proposal.
% Create demo data
x = randi(10,1,20); % Main data: 20 random integers
w = randi(24750,size(x))+250; % random weights between 250 and 250000
This is the part you'll need to consider. The idea is to normalize your weights between [0,1] but keep in mind that a weight of 0 will completely eliminate a value.
% Normalize the weights
% If you know the upper and lower limits of the weights (Safer than alternative)
knownWeightBounds = [250,250000];
% Or maybe use
knownWeightBounds = [0,250000];
% or maybe
knownWeightBounds = [min(w), max(w)];
% Scale the weights to 0:1
wNorm = (w-knownWeightBounds(1))/range(knownWeightBounds);
Now scale your data according to the weights.
% Scale the data by normalized weights.
xScaled = x .* wNorm;
% compute whatever stats you want on xScaled
mu = mean(xScaled)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!