# grpstats

Summary statistics organized by group

## Syntax

``statarray = grpstats(tbl,groupvar)``
``statarray = grpstats(tbl,groupvar,whichstats)``
``statarray = grpstats(tbl,groupvar,whichstats,Name,Value)``
``means = grpstats(X,group)``
``[stats1,...,statsN] = grpstats(X,group,whichstats)``
``[stats1,...,statsN] = grpstats(X,group,whichstats,'Alpha',alpha)``
``grpstats(X,group,alpha)``

## Description

example

````statarray = grpstats(tbl,groupvar)` returns a table or dataset array with the means for the data groups specified in `tbl` determined by the values of the grouping variable or variables specified in `groupvar`. If there is a single grouping variable, then there is a row in `statarray` for each value of the grouping variable. `grpstats` sorts the groups by order of appearance (if the grouping variable is a character vector or string scalar), in ascending numeric order (if the grouping variable is numeric), or in order of the levels (if the grouping variable is categorical).If `groupvar` is a string array or cell array of character vectors containing multiple grouping variable names, or a vector of column numbers, then there is a row in `statarray` for each observed unique combination of values of the grouping variables. `grpstats` sorts the groups by the values of the first grouping variable, then the second grouping variable, and so on.If any variables in `tbl` (other than those specified in `groupvar`) are not numeric or logical arrays, then you must specify the names or column numbers of the numeric and logical variables for which you want to calculate means using the name-value pair argument, `DataVars`.```

example

````statarray = grpstats(tbl,groupvar,whichstats)` returns the group values for the summary statistics types specified in `whichstats`.```

example

````statarray = grpstats(tbl,groupvar,whichstats,Name,Value)` uses additional options specified by one or more `Name,Value` pair arguments.```

example

````means = grpstats(X,group)` returns a column vector or matrix with the means of the groups of the data in the matrix or vector `X` determined by the values of the grouping variable or variables, `group`. The rows of `means` correspond to the grouping variable values.If there is a single grouping variable, then there is a row in `means` for each value of the grouping variable. `grpstats` sorts the groups by order of appearance (if the grouping variable is a character vector or string scalar), in ascending numeric order (if the grouping variable is numeric), or in order of the levels (if the grouping variable is categorical).If `group` is a string array or cell array of grouping variables, then there is a row in `means` for each observed unique combination of values of the grouping variables. `grpstats` sorts the groups by the values of the first grouping variable, then the second grouping variable, and so on.If `X` is a matrix, then `means` is a matrix with the same number of columns as `X`. Each column of `means` has the group means for the corresponding column of `X`.```

example

````[stats1,...,statsN] = grpstats(X,group,whichstats)` returns column vectors or arrays with group values for the summary statistic types specified in `whichstats`.```

example

````[stats1,...,statsN] = grpstats(X,group,whichstats,'Alpha',alpha)` specifies the significance level for confidence and prediction intervals.```

example

````grpstats(X,group,alpha)` plots the means of the groups of data in the vector or matrix `X` determined by the values of the grouping variable, `group`. The grouping variable values are on the horizontal plot axis. Each group mean has 100×(1 – `alpha`)% confidence intervals. If `X` is a matrix, then `grpstats` plots the means and confidence intervals for each column of `X`.If `group` is a cell array of grouping variables, then `grpstats` plots the means and confidence intervals for the groups of data in `X` determined by the unique combinations of values of the grouping variables. For example, if there are two grouping variables, each with two values, there are four possible combinations of grouping variable values. The plot includes only the combinations of values that exist in the input grouping variables (not all possible combinations).```

## Examples

collapse all

`load('hospital')`

The dataset array `hospital` has 100 observations and 7 variables.

Create a dataset array with only the variables `Sex`, `Age`, `Weight`, and `Smoker`.

`dsa = hospital(:,{'Sex','Age','Weight','Smoker'});`

`Sex` is a nominal array, with levels `Male` and `Female`. The variables `Age` and `Weight` have numeric values, and `Smoker` has logical values.

Compute the mean for the numeric and logical arrays, `Age`, `Weight`, and `Smoker`, grouped by the levels in `Sex`.

`statarray = grpstats(dsa,'Sex')`
```statarray = Sex GroupCount mean_Age mean_Weight mean_Smoker Female Female 53 37.717 130.47 0.24528 Male Male 47 38.915 180.53 0.44681 ```

`statarray` is a dataset array with two rows, corresponding to the levels in `Sex`. `GroupCount` is the number of observations in each group. The means of `Age`, `Weight`, and `Smoker`, grouped by `Sex`, are given in `mean_Age`, `mean_Weight`, and `mean_Smoker`.

Compute the mean for `Age` and `Weight`, grouped by the values in `Smoker`.

`statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})`
```statarray = Smoker GroupCount mean_Age mean_Weight 0 false 66 37.97 149.91 1 true 34 38.882 161.94 ```

In this case, not all variables in `dsa` (excluding the grouping variable, `Smoker`) are numeric or logical arrays; the variable `Sex` is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using `DataVars`.

Compute the minimum and maximum weight, grouped by the combinations of values in `Sex` and `Smoker`.

```statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},... 'DataVars','Weight')```
```statarray = Sex Smoker GroupCount min_Weight max_Weight Female_0 Female false 40 111 147 Female_1 Female true 13 115 146 Male_0 Male false 26 158 194 Male_1 Male true 21 164 202 ```

There are two unique values in `Smoker` and two levels in `Sex`, for a total of four possible combinations of values: Female Nonsmoker (`Female_0`), Female Smoker (`Female_1`), Male Nonsmoker (`Male_0`), and Male Smoker (`Male_1`).

Specify the names for the columns in the output.

```statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},... 'DataVars','Weight','VarNames',{'Gender','Smoker',... 'GroupCount','LowestWeight','HighestWeight'})```
```statarray = Gender Smoker GroupCount LowestWeight HighestWeight Female_0 Female false 40 111 147 Female_1 Female true 13 115 146 Male_0 Male false 26 158 194 Male_1 Male true 21 164 202 ```

`load('hospital')`

The dataset array `hospital` has 100 observations and 7 variables.

Create a dataset array with only the variables `Age`, `Weight`, and `Smoker`.

`dsa = hospital(:,{'Age','Weight','Smoker'});`

The variables `Age` and `Weight` have numeric values, and `Smoker` has logical values.

Compute the mean, minimum, and maximum for the numeric and logical arrays, `Age`, `Weight`, and `Smoker`, with no grouping.

`statarray = grpstats(dsa,[],{'mean','min','max'})`
```statarray = GroupCount mean_Age min_Age max_Age mean_Weight All 100 38.28 25 50 154 min_Weight max_Weight mean_Smoker min_Smoker max_Smoker All 111 202 0.34 false true ```

The observation name `All` indicates that all observations in `dsa` were used to compute the summary statistics.

`load('carsmall')`

All variables are measured for 100 cars. `Origin` is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). `Cylinders` has three unique values, `4`, `6`, and `8`, indicating the number of cylinders in each car.

Calculate the mean acceleration, grouped by country of origin.

`means = grpstats(Acceleration,Origin)`
```means = 6×1 14.4377 18.0500 15.8867 16.3778 16.6000 15.5000 ```

`means` is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.

Calculate the mean acceleration, grouped by both country of origin and number of cylinders.

`means = grpstats(Acceleration,{Origin,Cylinders})`
```means = 10×1 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000 ```

There are 18 possible combinations of grouping variable values because `Origin` has 6 unique values and `Cylinders` has 3 unique values. Only 10 of the possible combinations appear in the data, so `means` is a 10-by-1 vector of group means corresponding to the observed combinations of values.

Return the group names along with the mean acceleration for each group.

`[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})`
```means = 10×1 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000 ```
```grps = 10x2 cell {'USA' } {'4'} {'USA' } {'6'} {'USA' } {'8'} {'France' } {'4'} {'Japan' } {'4'} {'Japan' } {'6'} {'Germany'} {'4'} {'Germany'} {'6'} {'Sweden' } {'4'} {'Italy' } {'4'} ```

The output `grps` shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.

`load carsmall`

The variable `Acceleration` was measured for 100 cars. The variable `Origin` is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).

Return the minimum and maximum acceleration grouped by country of origin.

`[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})`
```grpMin = 6×1 8.0000 15.3000 13.9000 12.2000 15.7000 15.5000 ```
```grpMax = 6×1 22.2000 21.9000 18.2000 24.6000 17.5000 15.5000 ```
```grp = 6x1 cell {'USA' } {'France' } {'Japan' } {'Germany'} {'Sweden' } {'Italy' } ```

The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.

`load('carsmall')`

The variable `Weight` was measured for 100 cars. The variable `Model_Year` has three unique values, `70`, `76`, and `82`, which correspond to model years 1970, 1976, and 1982.

Calculate the mean weight and 90% prediction intervals for each model year.

```[means,pred,grp] = grpstats(Weight,Model_Year,... {'mean','predci','gname'},'Alpha',0.1);```

Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.

```ngrps = length(grp); % Number of groups errorbar((1:ngrps)',means,pred(:,2)-means) xlim([0.5 3.5]) set(gca,'xtick',1:ngrps,'xticklabel',grp) title('90% Prediction Intervals for Weight by Year')``` `load('carsmall')`

The variables `Acceleration` and `Weight` are the acceleration and weight values measured for 100 cars. The variable `Cylinders` is the number of cylinders in each car. The variable `Model_Year` has three unique values, `70`, `76`, and `82`, which correspond to model years 1970, 1976, and 1982.

Plot mean acceleration, grouped by `Cylinders`, with 95% confidence intervals.

`grpstats(Acceleration,Cylinders,0.05)` ```ans = 3×1 16.6706 16.4765 11.6406 ```

The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.

Plot mean acceleration and weight, grouped by `Cylinders`, and 95% confidence intervals. Scale the `Weight` values by 1000 so the means of `Weight` and `Acceleration` are the same order of magnitude.

`grpstats([Acceleration,Weight/1000],Cylinders,0.05)` ```ans = 3×2 16.6706 2.3726 16.4765 3.1255 11.6406 3.9703 ```

The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.

Plot mean acceleration, grouped by both `Cylinders` and `Model_Year`. Specify 95% confidence intervals.

`grpstats(Acceleration,{Cylinders,Model_Year},0.05)` ```ans = 8×1 16.1875 16.8667 16.7036 15.5000 17.0000 16.0333 11.0217 13.2222 ```

There are nine possible combinations of grouping variable values because there are three unique values in `Cylinders` and three unique values in `Model_Year`. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.

The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.

## Input Arguments

collapse all

Input data, specified as a table or dataset array. `tbl` must include at least one variable that is a grouping variable.

Summary statistics can only be calculated for variables that have a numeric or logical data type. If any variables in `tbl` (other than the grouping variables) are not numeric or logical arrays, then use the name-value pair argument `DataVars` to specify the names or column numbers of the numeric and logical variables for which to calculate summary statistics.

Identifiers for the grouping variables in the input data, `tbl`, specified as one of the following:

 Character vector, string array, or cell array of character vectors Names of the grouping variables Positive integer or vector of positive integers Variable numbers of the grouping variables Vector of logical values with number of elements equal to the number of variables in `tbl` Logical indicator with value `true` for grouping variables and `false` otherwise `[]` No groups (returns summary statistics for all data)

Any variable that is identified by `groupvar` as a grouping variable must have a valid grouping variable data type: categorical array, logical or numeric vector, datetime or duration vector, string array, or cell array of character vectors.

For example, consider an input table, `tbl`, with six variables. The fourth variable is named `Gender`. To be a valid grouping variable, the data type of `Gender` might be a string array, a cell array of character vectors, or a nominal array, with the unique values `Male` and `Female`. To specify the variable `Gender` as the grouping variable, you can use any of these syntaxes:

• `statarray = grpstats(tbl,'Gender')`

• `statarray = grpstats(tbl,4)`

• ```statarray = grpstats(tbl,logical([0 0 0 1 0 0]))```

Data Types: `double` | `logical` | `char` | `string` | `cell`

Type of summary statistics to compute, specified as one of the following values.

• Character vector or string scalar specifying the type of summary statistics, as described in this table.

TypeDescription
`'mean'`Mean
`'sem'`Standard error of the mean
`'numel'`Count, or number, of non`-NaN` elements
`'gname'`Group name
`'std'`Standard deviation
`'var'`Variance
`'min'`Minimum
`'max'`Maximum
`'range'`Range
`'meanci'`95% confidence interval for the mean. You can specify different significance levels using the `Alpha` name-value pair argument.
`'predci'`95% prediction interval for a new observation. You can specify different significance levels using the `Alpha` name-value pair argument.

• Function handle to specify any other type of summary statistics. You can use the handle to any function that accepts a column or matrix of data, and returns the same size output each time `grpstats` calls the function handle (even if the output for some groups is empty).

• If the function accepts a column of data, then the function can return either a scalar value or an nvals-by-1 column vector for descriptive statistics of length nvals (for example, a confidence interval has length two). If the function accepts a matrix, the function must return either a 1-by-ncols row vector or an nvals-by-ncols matrix, where ncols is the number of columns in the input data matrix.

• For functions that do not compute column-wise statistics, specify the computation direction while specifying the function. For example, to use the `sum` function, specify the function handle as `@(x)sum(x,1)` because `sum` computes column-wise statistics for matrices with two or more rows, but not for single-row matrices.

• String array or a cell array of character vectors or function handles to specify multiple types of summary statistics.

Example: `stat1 = grpstats(X,group,'sem')`

Example: ```stat1 = grpstats(X,group,@(x)sum(x,1))```

Example: ```[stat1,stat2,stat3] = grpstats(X,group,{'mean','std',@skewness})```

Significance level, specified as a scalar value in the range (0,1).

• When you specify `'meanci'` or `'predci'` in `whichstats`, you can use `alpha` to specify the significance level for the confidence or prediction intervals. If you specify `alpha`, then `grpstats` returns 100×(1 – `alpha`)% confidence or prediction intervals. If you do not specify `alpha`, then `grpstats` returns 95% intervals (```alpha = 0.05```).

• Use `alpha` with the syntax to plot group means and corresponding 100×(1 – `alpha`)% confidence intervals.

Data Types: `double`

Input data, specified as a vector or a matrix. If `X` is a matrix, then `grpstats` returns summary statistics for each column of `X`.

Data Types: `double` | `single`

Grouping variable, specified as a categorical array, logical or numeric vector, datetime or duration vector, string array, or cell array of character vectors. Each unique value in a grouping variable defines a group. `grpstats` groups data for summary statistics using the grouping variable values.

There must be a grouping variable value for each row of the input data `X`. Observations (rows) with the same value of the grouping variable are in the same group. Use `[]` to compute summary statistics for all data, without using groups.

For example, if `Gender` is a string array or cell array of character vectors with values `'Male'` and `'Female'`, you can use `Gender` as a grouping variable to summarize your data by gender.

You can also use more than one grouping variable to group data for summary statistics. In this case, specify a cell array of grouping variables.

For example, if `Smoker` is a logical vector with values `0` for nonsmokers and `1` for smokers, then specifying the cell array `{Gender,Smoker}` divides observations into four groups: Male Smoker, Male Nonsmoker, Female Smoker, and Female Nonsmoker. `grpstats` returns summary statistics only for the combinations of values that exist in the input grouping variables (not all possible combinations).

Data Types: `single` | `double` | `logical` | `char` | `string` | `cell` | `categorical` | `datetime` | `duration`

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'DataVars',[1,3,4],'Alpha',0.01` specifies that summary statistics be calculated for the 1st, 3rd, and 4th variables in a dataset array, with 99% confidence intervals.

Significance level for confidence and prediction intervals, specified as the comma-separated pair consisting of `'Alpha'` and a scalar value in the range (0,1).

When you include `'meanci'` or `'predci'` in `whichstats`, you can use `Alpha` to specify the significance level for confidence or prediction intervals. If you specify the value α, then `grpstats` returns 100×(1 – α)% confidence or prediction intervals.

If you do not specify a value for `Alpha`, then `grpstats` returns 95% intervals (α = 0.05).

Example: `'Alpha',0.1`

Data Types: `double`

Variable names or columns indicating which variables in the input data `tbl` you want to compute summary statistics for, specified as the comma-separated pair consisting of `'DataVars'` and a string array, cell array of character vectors, vector of positive integers, or logical vector. Use a character vector or string scalar to specify a variable name, a positive integer to specify a variable column number, or logical values to indicate which variables to include (`true` if you want to compute summary statistics, `false` otherwise).

You must specify `DataVars` if there are any variables in `tbl` (other than the grouping variables specified in `groupvar`) that are not numeric or logical arrays. Summary statistics can only be calculated for variables that have a numeric or logical data type.

Example: `'DataVars',{'Height','Weight'}`

Data Types: `double` | `string` | `cell` | `char`

Variable names for the output `statarray`, specified as the comma-separated pair consisting of `'VarNames'` and a string array or cell array of character vectors. By default, `grpstats` constructs output variable names by appending a prefix to the variable names from the input data `tbl`. This prefix corresponds to the summary statistic name.

Example: `'VarNames',{'Gender','GroupCount','MaleMean','FemaleMean'}`

Data Types: `string` | `cell`

## Output Arguments

collapse all

Group summary statistics, returned as a table or a dataset array. If `tbl` is a table, `grpstats` returns `statarray` as a table. If `tbl` is a dataset array, `grpstats` returns `statarray` as a dataset array.

`statarray` contains summary statistic values for the groups of data in `tbl` determined by the levels of the grouping variables specified by `groupvar`. There is a row in `statarray` for each observed value or combination of values in the variables specified by `groupvar`. The output `statarray` contains:

• All grouping variables specified by `groupvar`.

• The variable `GroupCount`, containing the number of observations in each group.

• Group summary statistic values for all variables in `tbl` (other than those specified by `groupvar`), or for only the variables specified using `DataVars`.

The total number of variables in `statarray` is ngroupvars + 1 + ndatavars×nstats, where ngroupvars is the number of variables in `groupvar`, ndatavars is the number of variables for which summary statistics are computed, and nstats is the number of summary statistic types specified in `whichstats`.

`grpstats` assigns default names to the variables in `statarray`, unless you specify variable names using the name-value pair argument `VarNames`.

Group means for the groups of data in the vector or matrix `X` determined by the levels of `group`, returned as an ngroups-by-ncols array. Here, ngroups is the number of unique values in the grouping variable, and ncols is the number of columns in `X`. If `X` is a vector, then `means` is a column vector.

Group summary statistics for the groups of data in the vector or matrix `X` determined by the levels of `group`, returned as ngroups-by-ncols arrays. Here, ngroups is the number of unique values in the grouping variable, and ncols is the number of columns in `X`. You must specify an output argument for each type of summary statistic specified in `whichstats`.

If a summary statistic type in `whichstats` returns a value of length nvals (for example, a confidence interval is a descriptive statistic of length two), then the corresponding output argument is an ngroups-by-ncols-by-nvals array.

## Algorithms

• `grpstats` treats `NaN`s as missing values, and removes them from the input data before calculating summary statistics.

• `grpstats` ignores empty group names.

## Alternative Functionality

MATLAB® includes the function `groupsummary`, which also returns group summaries and is recommended when you are working with a table.

## Extended Capabilities

Introduced before R2006a