Documentation

grpstats

Summary statistics organized by group

Description

example

statarray = grpstats(tbl,groupvar) returns a table or dataset array with the means for the data groups specified in tbl determined by the values of the grouping variable or variables specified in groupvar.

• If there is a single grouping variable, then there is a row in statarray for each value of the grouping variable. grpstats sorts the groups by order of appearance (if the grouping variable is a character vector or string scalar), in ascending numeric order (if the grouping variable is numeric), or in order of the levels (if the grouping variable is categorical).

• If groupvar is a string array or cell array of character vectors containing multiple grouping variable names, or a vector of column numbers, then there is a row in statarray for each observed unique combination of values of the grouping variables. grpstats sorts the groups by the values of the first grouping variable, then the second grouping variable, and so on.

• If any variables in tbl (other than those specified in groupvar) are not numeric or logical arrays, then you must specify the names or column numbers of the numeric and logical variables for which you want to calculate means using the name-value pair argument, DataVars.

example

statarray = grpstats(tbl,groupvar,whichstats) returns the group values for the summary statistics types specified in whichstats.

example

statarray = grpstats(tbl,groupvar,whichstats,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

means = grpstats(X,group) returns a column vector or matrix with the means of the groups of the data in the matrix or vector X determined by the values of the grouping variable or variables, group. The rows of means correspond to the grouping variable values.

• If there is a single grouping variable, then there is a row in means for each value of the grouping variable. grpstats sorts the groups by order of appearance (if the grouping variable is a character vector or string scalar), in ascending numeric order (if the grouping variable is numeric), or in order of the levels (if the grouping variable is categorical).

• If group is a string array or cell array of grouping variables, then there is a row in means for each observed unique combination of values of the grouping variables. grpstats sorts the groups by the values of the first grouping variable, then the second grouping variable, and so on.

• If X is a matrix, then means is a matrix with the same number of columns as X. Each column of means has the group means for the corresponding column of X.

example

[stats1,...,statsN] = grpstats(X,group,whichstats) returns column vectors or arrays with group values for the summary statistic types specified in whichstats.

example

[stats1,...,statsN] = grpstats(X,group,whichstats,'Alpha',alpha) specifies the significance level for confidence and prediction intervals.

example

grpstats(X,group,alpha) plots the means of the groups of data in the vector or matrix X determined by the values of the grouping variable, group. The grouping variable values are on the horizontal plot axis. Each group mean has 100×(1 – alpha)% confidence intervals.

• If X is a matrix, then grpstats plots the means and confidence intervals for each column of X.

• If group is a cell array of grouping variables, then grpstats plots the means and confidence intervals for the groups of data in X determined by the unique combinations of values of the grouping variables. For example, if there are two grouping variables, each with two values, there are four possible combinations of grouping variable values. The plot includes only the combinations of values that exist in the input grouping variables (not all possible combinations).

Examples

collapse all

Load the sample data.

The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Sex, Age, Weight, and Smoker.

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

Sex is a nominal array, with levels Male and Female. The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean for the numeric and logical arrays, Age, Weight, and Smoker, grouped by the levels in Sex.

statarray = grpstats(dsa,'Sex')
statarray =
Sex       GroupCount    mean_Age    mean_Weight    mean_Smoker
Female    Female    53            37.717      130.47         0.24528
Male      Male      47            38.915      180.53         0.44681

statarray is a dataset array with two rows, corresponding to the levels in Sex. GroupCount is the number of observations in each group. The means of Age, Weight, and Smoker, grouped by Sex, are given in mean_Age, mean_Weight, and mean_Smoker.

Compute the mean for Age and Weight, grouped by the values in Smoker.

statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})
statarray =
Smoker    GroupCount    mean_Age    mean_Weight
0    false     66             37.97      149.91
1    true      34            38.882      161.94

In this case, not all variables in dsa (excluding the grouping variable, Smoker) are numeric or logical arrays; the variable Sex is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using DataVars.

Compute the minimum and maximum weight, grouped by the combinations of values in Sex and Smoker.

statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight')
statarray =
Sex       Smoker    GroupCount    min_Weight    max_Weight
Female_0    Female    false     40            111           147
Female_1    Female    true      13            115           146
Male_0      Male      false     26            158           194
Male_1      Male      true      21            164           202

There are two unique values in Smoker and two levels in Sex, for a total of four possible combinations of values: Female Nonsmoker (Female_0), Female Smoker (Female_1), Male Nonsmoker (Male_0), and Male Smoker (Male_1).

Specify the names for the columns in the output.

statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight','VarNames',{'Gender','Smoker',...
'GroupCount','LowestWeight','HighestWeight'})
statarray =
Gender    Smoker    GroupCount    LowestWeight    HighestWeight
Female_0    Female    false     40            111             147
Female_1    Female    true      13            115             146
Male_0      Male      false     26            158             194
Male_1      Male      true      21            164             202

Load the sample data.

The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Age, Weight, and Smoker.

dsa = hospital(:,{'Age','Weight','Smoker'});

The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean, minimum, and maximum for the numeric and logical arrays, Age, Weight, and Smoker, with no grouping.

statarray = grpstats(dsa,[],{'mean','min','max'})
statarray =
GroupCount    mean_Age    min_Age    max_Age    mean_Weight
All    100           38.28       25         50         154

min_Weight    max_Weight    mean_Smoker    min_Smoker    max_Smoker
All    111           202           0.34           false         true

The observation name All indicates that all observations in dsa were used to compute the summary statistics.

Load the sample data.

All variables are measured for 100 cars. Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). Cylinders has three unique values, 4, 6, and 8, indicating the number of cylinders in each car.

Calculate the mean acceleration, grouped by country of origin.

means = grpstats(Acceleration,Origin)
means = 6×1

14.4377
18.0500
15.8867
16.3778
16.6000
15.5000

means is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.

Calculate the mean acceleration, grouped by both country of origin and number of cylinders.

means = grpstats(Acceleration,{Origin,Cylinders})
means = 10×1

17.0818
16.5267
11.6406
18.0500
15.9143
15.5000
16.3375
16.7000
16.6000
15.5000

There are 18 possible combinations of grouping variable values because Origin has 6 unique values and Cylinders has 3 unique values. Only 10 of the possible combinations appear in the data, so means is a 10-by-1 vector of group means corresponding to the observed combinations of values.

Return the group names along with the mean acceleration for each group.

[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})
means = 10×1

17.0818
16.5267
11.6406
18.0500
15.9143
15.5000
16.3375
16.7000
16.6000
15.5000

grps = 10x2 cell array
{'USA'    }    {'4'}
{'USA'    }    {'6'}
{'USA'    }    {'8'}
{'France' }    {'4'}
{'Japan'  }    {'4'}
{'Japan'  }    {'6'}
{'Germany'}    {'4'}
{'Germany'}    {'6'}
{'Sweden' }    {'4'}
{'Italy'  }    {'4'}

The output grps shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.

Load the sample data.

The variable Acceleration was measured for 100 cars. The variable Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).

Return the minimum and maximum acceleration grouped by country of origin.

[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})
grpMin = 6×1

8.0000
15.3000
13.9000
12.2000
15.7000
15.5000

grpMax = 6×1

22.2000
21.9000
18.2000
24.6000
17.5000
15.5000

grp = 6x1 cell array
{'USA'    }
{'France' }
{'Japan'  }
{'Germany'}
{'Sweden' }
{'Italy'  }

The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.

Load the sample data.

The variable Weight was measured for 100 cars. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Calculate the mean weight and 90% prediction intervals for each model year.

[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);

Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.

ngrps = length(grp); % Number of groups
errorbar((1:ngrps)',means,pred(:,2)-means)
xlim([0.5 3.5])
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year') Load the sample data.

The variables Acceleration and Weight are the acceleration and weight values measured for 100 cars. The variable Cylinders is the number of cylinders in each car. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Plot mean acceleration, grouped by Cylinders, with 95% confidence intervals.

grpstats(Acceleration,Cylinders,0.05) ans = 3×1

16.6706
16.4765
11.6406

The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.

Plot mean acceleration and weight, grouped by Cylinders, and 95% confidence intervals. Scale the Weight values by 1000 so the means of Weight and Acceleration are the same order of magnitude.

grpstats([Acceleration,Weight/1000],Cylinders,0.05) ans = 3×2

16.6706    2.3726
16.4765    3.1255
11.6406    3.9703

The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.

Plot mean acceleration, grouped by both Cylinders and Model_Year. Specify 95% confidence intervals.

grpstats(Acceleration,{Cylinders,Model_Year},0.05) ans = 8×1

16.1875
16.8667
16.7036
15.5000
17.0000
16.0333
11.0217
13.2222

There are nine possible combinations of grouping variable values because there are three unique values in Cylinders and three unique values in Model_Year. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.

The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.

Input Arguments

collapse all

Input data, specified as a table or dataset array. tbl must include at least one variable that is a grouping variable.

Summary statistics can only be calculated for variables that have a numeric or logical data type. If any variables in tbl (other than the grouping variables) are not numeric or logical arrays, then use the name-value pair argument DataVars to specify the names or column numbers of the numeric and logical variables for which to calculate summary statistics.

Identifiers for the grouping variables in the input data, tbl, specified as one of the following:

 Character vector, string array, or cell array of character vectors Names of the grouping variables Positive integer or vector of positive integers Variable numbers of the grouping variables Vector of logical values with number of elements equal to the number of variables in tbl Logical indicator with value true for grouping variables and false otherwise [] No groups (returns summary statistics for all data)

Any variable that is identified by groupvar as a grouping variable must have a valid grouping variable data type: categorical array, logical or numeric vector, datetime or duration vector, string array, or cell array of character vectors.

For example, consider an input table, tbl, with six variables. The fourth variable is named Gender. To be a valid grouping variable, the data type of Gender might be a string array, a cell array of character vectors, or a nominal array, with the unique values Male and Female. To specify the variable Gender as the grouping variable, you can use any of these syntaxes:

• statarray = grpstats(tbl,'Gender')

• statarray = grpstats(tbl,4)

• statarray = grpstats(tbl,logical([0 0 0 1 0 0]))

Data Types: double | logical | char | string | cell

Types of summary statistics to compute, specified as a character vector, a string scalar, a function handle, a string array, or a cell array of character vectors and function handles. Use a cell array to specify multiple types of summary statistics.

Values include:

 'mean' Mean 'sem' Standard error of the mean 'numel' Count, or number, of non-NaN elements 'gname' Group name 'std' Standard deviation 'var' Variance 'min' Minimum 'max' Maximum 'range' Range 'meanci' 95% confidence interval for the mean 'predci' 95% prediction interval for a new observation

Example: [stat1,stat2] = grpstats(X,group,{'mean','sem'})

You can specify different significance levels for the 'meanci' and 'predci' options using the name-value pair argument, Alpha.

To specify other types of summary statistics, you can use function handles. You can use the handle to any function that accepts a column or matrix of data, and returns the same size output each time grpstats calls it (even if the output for some groups is empty).

If the function accepts a column of data, then the function can return either a scalar value, or an nvals-by-1 column vector for descriptive statistics of length nvals (for example, confidence intervals have length two). If the function accepts a matrix, it must either return a 1-by-ncols row vector, or an nvals-by-ncols matrix, where ncols is the number of columns in the input data matrix.

Example: [stat1,stat2,stat3] = grpstats(X,group,{'mean','std',@skewness})

For functions that do not compute column-wise statistics, specify the computation direction while specifying the function.

Example: stat1 = grpstats(X,group,@(x)sum(x,1))

Significance level, specified as a scalar value in the range (0,1).

• When you specify 'meanci' or 'predci' in whichstats, you can use alpha to specify the significance level for the confidence or prediction intervals. If you specify alpha, then grpstats returns 100×(1 – alpha)% confidence or prediction intervals. If you do not specify alpha, then grpstats returns 95% intervals (alpha = 0.05).

• Use alpha with the syntax to plot group means and corresponding 100×(1 – alpha)% confidence intervals.

Data Types: double

Input data, specified as a vector or a matrix. If X is a matrix, then grpstats returns summary statistics for each column of X.

Data Types: double | single

Grouping variable, specified as a categorical array, logical or numeric vector, datetime or duration vector, string array, or cell array of character vectors. Each unique value in a grouping variable defines a group. grpstats groups data for summary statistics using the grouping variable values.

There must be a grouping variable value for each row of the input data X. Observations (rows) with the same value of the grouping variable are in the same group. Use [] to compute summary statistics for all data, without using groups.

For example, if Gender is a string array or cell array of character vectors with values 'Male' and 'Female', you can use Gender as a grouping variable to summarize your data by gender.

You can also use more than one grouping variable to group data for summary statistics. In this case, specify a cell array of grouping variables.

For example, if Smoker is a logical vector with values 0 for nonsmokers and 1 for smokers, then specifying the cell array {Gender,Smoker} divides observations into four groups: Male Smoker, Male Nonsmoker, Female Smoker, and Female Nonsmoker. grpstats returns summary statistics only for the combinations of values that exist in the input grouping variables (not all possible combinations).

Data Types: single | double | logical | char | string | cell | categorical | datetime | duration

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'DataVars',[1,3,4],'Alpha',0.01 specifies that summary statistics be calculated for the 1st, 3rd, and 4th variables in a dataset array, with 99% confidence intervals.

Significance level for confidence and prediction intervals, specified as the comma-separated pair consisting of 'Alpha' and a scalar value in the range (0,1).

When you include 'meanci' or 'predci' in whichstats, you can use Alpha to specify the significance level for confidence or prediction intervals. If you specify the value α, then grpstats returns 100×(1 – α)% confidence or prediction intervals.

If you do not specify a value for Alpha, then grpstats returns 95% intervals (α = 0.05).

Example: 'Alpha',0.1

Data Types: double

Variable names or columns indicating which variables in the input data tbl you want to compute summary statistics for, specified as the comma-separated pair consisting of 'DataVars' and a string array, cell array of character vectors, vector of positive integers, or logical vector. Use a character vector or string scalar to specify a variable name, a positive integer to specify a variable column number, or logical values to indicate which variables to include (true if you want to compute summary statistics, false otherwise).

You must specify DataVars if there are any variables in tbl (other than the grouping variables specified in groupvar) that are not numeric or logical arrays. Summary statistics can only be calculated for variables that have a numeric or logical data type.

Example: 'DataVars',{'Height','Weight'}

Data Types: double | string | cell | char

Variable names for the output statarray, specified as the comma-separated pair consisting of 'VarNames' and a string array or cell array of character vectors. By default, grpstats constructs output variable names by appending a prefix to the variable names from the input data tbl. This prefix corresponds to the summary statistic name.

Example: 'VarNames',{'Gender','GroupCount','MaleMean','FemaleMean'}

Data Types: string | cell

Output Arguments

collapse all

Group summary statistics, returned as a table or a dataset array. If tbl is a table, grpstats returns statarray as a table. If tbl is a dataset array, grpstats returns statarray as a dataset array.

statarray contains summary statistic values for the groups of data in tbl determined by the levels of the grouping variables specified by groupvar. There is a row in statarray for each observed value or combination of values in the variables specified by groupvar. The output statarray contains:

• All grouping variables specified by groupvar.

• The variable GroupCount, containing the number of observations in each group.

• Group summary statistic values for all variables in tbl (other than those specified by groupvar), or for only the variables specified using DataVars.

The total number of variables in statarray is ngroupvars + 1 + ndatavars×nstats, where ngroupvars is the number of variables in groupvar, ndatavars is the number of variables for which summary statistics are computed, and nstats is the number of summary statistic types specified in whichstats.

grpstats assigns default names to the variables in statarray, unless you specify variable names using the name-value pair argument VarNames.

Group means for the groups of data in the vector or matrix X determined by the levels of group, returned as an ngroups-by-ncols array. Here, ngroups is the number of unique values in the grouping variable, and ncols is the number of columns in X. If X is a vector, then means is a column vector.

Group summary statistics for the groups of data in the vector or matrix X determined by the levels of group, returned as ngroups-by-ncols arrays. Here, ngroups is the number of unique values in the grouping variable, and ncols is the number of columns in X. You must specify an output argument for each type of summary statistic specified in whichstats.

If a summary statistic type in whichstats returns a value of length nvals (for example, a confidence interval is a descriptive statistic of length two), then the corresponding output argument is an ngroups-by-ncols-by-nvals array.

Algorithms

• grpstats treats NaNs as missing values, and removes them from the input data before calculating summary statistics.

• grpstats ignores empty group names.