Plot observation diagnostics of linear regression model
plotDiagnostics creates a plot of observation diagnostics
such as leverage, Cook's distance, and delete-1 statistics to identify outliers and
plotDiagnostics( creates a leverage
plot of the linear regression model (
mdl) observations. A
dotted line in the plot represents the recommended threshold values.
specifies the graphical properties of diagnostic data points using one or more
name-value pair arguments. For example, you can specify the marker symbol and size
for the data points.
returns graphics objects for the lines or contour in the plot using any of the input
argument combination in the previous syntaxes. Use
h = plotDiagnostics(___)
h to modify
the properties of a specific line or contour after you create the plot. For a list
of properties, see Line Properties and Contour Properties.
Find Outliers Using Leverage and Cook's Distance
Plot the leverage values and Cook's distances of observations and find the outliers.
carsmall data set and fit a linear regression model of the mileage as a function of model year, weight, and weight squared.
load carsmall tbl = table(MPG,Weight); tbl.Year = categorical(Model_Year); mdl = fitlm(tbl,'MPG ~ Year + Weight^2');
Plot the leverage values.
plotDiagnostics(mdl) legend('show') % Show the legend
The dotted line represents the recommended threshold value 2*p/n, where p is the number of coefficients, and n is the number of observations. Find the threshold value using the
t_leverage = 2*mdl.NumCoefficients/mdl.NumObservations
t_leverage = 0.1064
Find the observations with leverage values that exceed the threshold value.
find(mdl.Diagnostics.Leverage > t_leverage)
ans = 3×1 26 32 35
You can also find an observation number by using a data tip. Select the data points above the threshold line to display their data tips. The data tip includes the x-axis and y-axis values for the selected point, along with the observation number.
Plot the Cook's distance values.
The dotted line represents the recommended threshold value. Compute the threshold value
t_cookd = 3*mean(mdl.Diagnostics.CooksDistance,'omitnan')
t_cookd = 0.0320
Find the observations with the Cook's distance values that exceed the threshold value.
find(mdl.Diagnostics.CooksDistance > t_cookd)
ans = 6×1 26 35 80 90 92 97
Two observations (26 and 35) are outliers by both measures, but some points (32, 80, 90, 92, and 97) are outliers by only one measure.
plottype — Type of plot
'leverage' (default) |
Type of plot, specified as one of the values in this table.
|Value||Plot Type||Dotted Reference Line in Plot||Purpose|
|Residual vs. leverage with overlaid contours of Cook's distance||Contours of Cook's distance||Identify observations with large residual values, high leverage, and large Cook's distance values.|
|Cook's distance||Recommended threshold, computed by
||Identify observations with large Cook's distance values.|
|Delete-1 ratio of determinant of covariance||Recommended thresholds, computed by
||Identify observations where the delete-1 statistic value is not in the range of the recommended thresholds.|
|Delete-1 scaled differences in coefficient estimates||Recommended threshold, computed by
||Identify observations with large delete-1 statistic values.|
|Delete-1 scaled differences in fitted values||Recommended threshold, computed by
||Identify observations with large delete-1 statistic values in an absolute value.|
|Leverage||Recommended threshold, computed by
||Identify high leverage observations.|
|Delete-1 variance||Mean squared error (||Compare the delete-1 variance with the mean squared error.|
For all plot types except
x-axis is the row number (case order) of
Diagnostics property of
contains the diagnostic values used by
For more information about observation diagnostics, see Cook’s Distance, Delete-1 Statistics, and Leverage.
Specify optional pairs of arguments as
the argument name and
Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name in quotes.
The graphical properties listed here are only a subset. For a complete list, see Line Properties. The specified properties determine the appearance of diagnostic data points.
Color — Line color
RGB triplet | hexadecimal color code | color name | short name
Line color, specified as the comma-separated pair consisting of
an RGB triplet, hexadecimal color code, color name, or short name for one of the
color options listed in the following table.
'Color' name-value pair argument also determines marker outline color and marker fill color if
'auto' (default) and
For a custom color, specify an RGB triplet or a hexadecimal color code.
An RGB triplet is a three-element row vector whose elements specify the intensities of the red, green, and blue components of the color. The intensities must be in the range
[0,1], for example,
[0.4 0.6 0.7].
A hexadecimal color code is a string scalar or character vector that starts with a hash symbol (
#) followed by three or six hexadecimal digits, which can range from
F. The values are not case sensitive. Therefore, the color codes
Alternatively, you can specify some common colors by name. This table lists the named color options, the equivalent RGB triplets, and hexadecimal color codes.
|Color Name||Short Name||RGB Triplet||Hexadecimal Color Code||Appearance|
|Not applicable||Not applicable||Not applicable||No color|
Here are the RGB triplets and hexadecimal color codes for the default colors MATLAB® uses in many types of plots.
|RGB Triplet||Hexadecimal Color Code||Appearance|
LineWidth — Line width
Line width, specified as the comma-separated pair consisting of
and a positive value in points. If the line has markers, then the line width also
affects the marker edges.
Marker — Marker symbol
'x' | ...
Marker symbol, specified as the comma-separated pair consisting of
and one of the values in this table.
|No markers||Not applicable|
MarkerEdgeColor — Marker outline color
'auto' (default) |
'none' | RGB triplet | hexadecimal color code | color name | short name
Marker outline color, specified as the comma-separated pair consisting of
'MarkerEdgeColor' and an RGB triplet, hexadecimal color code,
color name, or short name for one of the color options listed in the
Color name-value pair argument.
The default value of
'auto' uses the same color specified by
MarkerFaceColor — Marker fill color
'none' (default) |
'auto' | RGB triplet | hexadecimal color code | color name | short name
Marker fill color, specified as the comma-separated pair consisting of
'MarkerFaceColor' and an RGB triplet, hexadecimal color code,
color name, or short name for one of the color options listed in the
Color name-value pair argument.
'auto' value uses the same color specified by using
MarkerSize — Marker size
6 (default) | positive value
Marker size, specified as the comma-separated pair consisting of
'MarkerSize' and a positive value in points.
h — Graphics objects
Graphics objects corresponding to the lines or contour in the plot, returned as a graphics array. Use dot notation to query and set properties of the graphics objects. For details, see Line Properties and Contour Properties.
You can use name-value pair arguments to specify the appearance of
diagnostic data points corresponding to the first graphics object
'dfbetas', the plot includes a line object for each
coefficient. Name-value pair arguments specify the line object properties of
all coefficients. You can modify the properties of each coefficient
separately by using the corresponding graphics object.
Cook’s distance is the scaled change in fitted values, which is useful for identifying outliers in the X values (observations for predictor variables). Cook’s distance shows the influence of each observation on the fitted response values. An observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier.
Each element in the Cook's distance D is the normalized change in the fitted response values due to the deletion of an observation. The Cook’s distance of observation i is
is the jth fitted response value.
is the jth fitted response value, where the fit does not include observation i.
MSE is the mean squared error.
p is the number of coefficients in the regression model.
Cook’s distance is algebraically equivalent to the following expression:
where ri is the ith residual, and hii is the ith leverage value.
For more details, see Cook’s Distance.
Delete-1 statistics are useful for finding the influence of each observation. These statistics capture the changes that would result from excluding each observation in turn from the fit. If the delete-1 statistics differ significantly from the model using all observations, then the observation is influential.
See Delete-1 Statistics for the definitions and usages of the delete-1 statistics.
Leverage is a measure of the effect of a particular observation on the regression predictions due to the position of that observation in the space of the inputs.
The leverage of observation i is the value of the ith diagonal term hii of the hat matrix H. The hat matrix H is defined in terms of the data matrix X:
H = X(XTX)–1XT.
The hat matrix is also known as the projection matrix because it projects the vector of observations y onto the vector of predictions , thus putting the "hat" on y.
Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered an outlier if its leverage substantially exceeds p/n, where n is the number of observations.
For more details, see Hat Matrix and Leverage.
The data cursor displays the values of the selected plot point in a data tip (small text box located next to the data point). The data tip includes the x-axis and y-axis values for the selected point, along with the observation name or number.
legend('show')to show the pre-populated legend.
LinearModelobject provides multiple plotting functions.
When creating a model, use
plotAddedto understand the effect of adding or removing a predictor variable.
When verifying a model, use
plotDiagnosticsto find questionable data and to understand the effect of each observation. Also, use
plotResidualsto analyze the residuals of the model.
After fitting a model, use
plotEffectsto understand the effect of a particular predictor. Use
plotInteractionto understand the interaction effect between two predictors. Also, use
plotSliceto plot slices through the prediction surface.
 Neter, J., M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. Applied Linear Statistical Models, Fourth Edition. Chicago: McGraw-Hill Irwin, 1996.
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Introduced in R2012a