Main Content

Forecast Mortality Trends Using Lee-Carter Model

This example shows how to forecast trends in mortality using the Lee-Carter method with an ARIMA model.

The modeling of mortality and life expectancy are major research areas in life science, demography, and insurance pricing [1,2]. In the last century, because of improvements in life expectancy, financial exposures have increased for the life insurance and pension industries. In the actuarial literature, this exposure is known as longevity risk. To mitigate this risk, it is necessary to build projected life tables to predict the trend in life expectancy. Consequently, several actuarial models have been developed for better pricing of insurance premiums.

In 1992, Lee and Carter introduced the first mortality model with stochastic forecasting. The Lee-Carter (LC) model expresses the logarithm of mortality rates as the sum of age-specific mortality and time-varying parameters. The LC model is a benchmark for census bureaus and it is a leading statistical model for mortality forecasting.

This example implements the LC model and demonstrates two tools for estimating the model parameters. The example uses singular value decomposition (SVD) and univariate ARIMA to solve for and forecast the time-varying parameter.

Load and Visualize Data

Load the mortality data. This example uses United States mortality data between 1997 and 2020 from the Centers for Disease Control and Prevention (CDC). The data records the yearly mortality rate for the male, female, and combined U.S. populations. The mortality rate is computed for one-year age groups, starting with 0–1 and ending with 99–100. There is also an additional age group, 100+, that always has a mortality rate of 1 and this is discarded from the analysis.

For visualization purposes, combine the ages into a smaller number of groups, but for the implementation of the LC algorithm, you use the one-year age categories.

load CDCLifeTables.mat
Data = USAllLifeTable;

% Compute mortality per age groups
MortalityRates = Data.Variables;
Years = Data.Properties.VariableNames;
Years = categorical(Years,Years,Ordinal=true);
Labels = {'<1','[1,5)','[5,15)','[15,25)','[25,35)','[35,55)','[55,75)','>=75'};
Labels = categorical(Labels,Labels,"Ordinal",true);
AgeLimits = [0, 1, 5, 15, 25, 35, 55, 75, 100];

% Initialize averaged mortality rate matrix
AverageMortalityRates = zeros(length(Labels), length(Years));

% Loop over each year
for i = 1:length(Years)
    % Calculate mortality rate for each age group
    for j = 2:length(AgeLimits)
        % Find rows corresponding to age group
        AgeRows = Data.Properties.RowNames((AgeLimits(j-1)+1):AgeLimits(j));
        
        % Calculate mortality rate
        AverageMortalityRates(j-1, i) = mean(Data{AgeRows,i});
    end
end

t = tiledlayout(2,1,'TileSpacing','compact');
nexttile
% Plot 0-1 age group over all years
plot(Years,100*AverageMortalityRates(1,:),'-*')
hold on
% Plot 25-35 age group over all years
plot(Years,100*AverageMortalityRates(5,:),'-s')
xlabel('Year')
legend({'Below 1 year','25-35 years'},'Location',"best")

nexttile
% Plot all age groups for 1997
plot(Labels,100*AverageMortalityRates(:,1),'-*')
hold on
% Plot all age groups for 2020
plot(Labels,100*AverageMortalityRates(:,24),'-s')
hold off
xlabel('Age Groups')
title(t,'US Mortality Rates (%)')
legend({'1997','2020'},'Location','best')

For a given age group, the trend shows that the mortality rate decreases with time because of improvements in health care, lifestyle, and so on. However, the trend slows, and even reverses, for some age groups starting around 2015. The rates are generally higher for older age groups, with an exception for infants in their first year, who have higher mortality compared to most other groups. 2020 is an unusual year, presumably due to the effects of SARS-CoV-2.

For a fixed year, 1997 or 2020, the trend shows a parabolic shape with high mortality rates for older age groups (over 75 years) and a somewhat higher mortality rates for infants in their first year.

Build Lee-Carter Model

Equation

The Lee-Carter model stipulates that the logarithm of the death rate for time t and age group x, mx,t, is the sum of three terms. The first term is an age-specific component, ax, which is independent of time and can be interpreted as an average age profile of mortality. The second term is the product of two components, kt and bx. kt is a time-dependent parameter that serves as a one-dimensional summary of the mortality rate across all age groups during year t. More specifically, kt is the (rescaled) first principal component for the age profile of mortality at time t. bx is an age-specific coefficient representing the sensitivity of age group x to changes in kt. An error term ϵx,t is the final component in the LC model. The general equation is:

ln(mx,t)=ax+bxkt+ϵx,t             (1)

To guarantee a unique solution, the LC model adds the following two constraints on the coefficients bx and kt:

xbx=1       (2.1) and tkt=0       (2.2)

These two conditions are such that the coefficient ax is the average log mortality rate for a given age.

Interpretation

  • Vector a is the average age profile of mortality.

  • Vector k tracks the mortality changes over time.

  • Vector b determines how much each age group changes when k changes.

When k is linear in time, each age group changes according to its specific exponential rate.

The terms on the right-hand side of equation (1) are non-observable. As such, it is not possible to perform ordinary least-squares methods to fit the model. Instead, the LC model follows a two-stage estimation approach:

  1. Estimate the parameters using SVD.

  2. Forecast the time series kt using an ARIMA model.

Estimation Using SVD

Historically, LC estimated the parameter vectors a,b, and k using U.S. mortality data from 1933 to 1987. The estimation is based on a least-squares approach. Namely, the vector a is estimated by taking the average of the log rates over time

1Ttln(mx,t)  =1Ttax+bx1Ttkt+1Ttϵx,t

Given equation (2.2) and the randomness of the error term ϵx,t, you can get an estimator for ax:

axˆ=1Ttln(mx,t)              (3)

Using Singular Value Decomposition (SVD) on the residuals, you can solve for the parameter column vectors b and k. Given a time span of size T (t=0,1,2,...,T-1) and N age groups (x=1,2,...,N), the N-by-T residual matrix Z, with age groups spanning rows and time spanning the columns, is defined as:

Z=ln(mx,t)-axˆ=

[ln(mx1,t0)-a1ˆln(mx0,t1)-a1ˆln(mx0,t2)-a1ˆ...ln(mx0,tT-1)-a1ˆln(mx2,t0)-a2ˆln(mx2,t1)-a2ˆln(mx2,t2)-a2ˆ...ln(mx2,tT-1)-a2ˆln(mx3,t0)-a3ˆln(mx3,t1)-a3ˆln(mx3,t2)-a3ˆ...ln(mx3,tT-1)-a3ˆ...............ln(mxN,t0)-aNˆln(mxN,t1)-aNˆln(mxN,t2)-aNˆ...ln(mxN,tT-1)-aNˆ]    (4)

The values of mx,t and axˆare known for each age group and time. You can rewrite the matrix Z as follows to solve for the unknowns bˆ and kˆ:

Z=ln(mx,t)-axˆ=U.S.V*=bˆ*kˆ= i=1kλiPx,iQt,i        (5),

where k=rank(Z),λi(i=1,2,...,k) are the singular values of Z in decreasing order, and Px,i and Qt,i(i=1,2,...,k) are the corresponding left and right singular vectors. The first term approximation of the SVD gives:

bˆ=Px,1and        kˆ=λ1Qt,1      (6)

MATLAB® uses a LAPACK-based implementation of the SVD. With this implementation, the first singular vectors from an all-positive matrix that will always have all-negative elements. In this example, you must flip the sign for the solutions bxˆand ktˆ; otherwise, the trend is opposite to the observed and expected data [3].

% Initialize the Mortality matrix
Mxt0 = Data.Variables;

% Remove the age group '100+', which always has a mortality rate of 1.
Mxt0 = Mxt0(1:end-1,:);

% Remove year 2020 from the time series.
Mxt0 = Mxt0(:,1:end-1);

% Define T, the number of years
T = size(Mxt0,2); 

% Calculate the estimator for the age-specific pattern of mortality coefficient, a_hat
a_hat = (1/T)*sum(log(Mxt0),2); % sum(A,2) is column vector containing the sum of each row.

% Define matrix Z
Z = log(Mxt0)-a_hat;

% Apply SVD to Z
[U,S,V] = svd(Z);

% Solve for estimates b_hat and k_hat. 
b_hat = -U(:,1);
sum_b_hat = sum(b_hat);
b_hat = b_hat/sum_b_hat;
k_hat = -sum_b_hat*S(1,1)*V(:,1);

% Plot the parameters
tiledlayout(2,2,'TileSpacing','compact');
nexttile([1 2])
plot(Years(1:end-1),k_hat)
xlabel('Years')
ylabel('k_t')
title('US Mortality Index')

nexttile
plot(a_hat)
xlabel('Age Groups')
title('Estimate $$\hat{a}_x$$','Interpreter','latex')

nexttile
plot(b_hat)
xlabel('Age Groups')
title('Estimate $$\hat{b}_x$$','Interpreter','latex')

With the coefficients axˆ and bxˆ fitted, you only need to forecast the mortality trend ktˆ for future mortality estimates. There are several forecasting methods available. This example uses an ARIMA(p,d,q) model.

Forecast Using ARIMA Models

Motivation

Predicting mortality trends with the LC model requires forecasting kt over time, usually as a stochastic process. kt is a univariate time series, and, if the requisite assumptions are met, ARIMA(p,d,q) models are effective at forecasting such a series. The fundamental assumption of an ARIMA(p,d,q) model is that past values of a given time series contain information about the future values of the time series. More specifically, assume that the series is modeled using a combination of autoregression (AR) and moving average (MA) terms. AR terms model the linear dependency of kt on its past values kt-i,i=1,2,..., while MA terms model the linear dependency of kt on past values of a white-noise process.

ARMA models assume that the given time series is stationary, which may not be correct. When a given series is not stationary, you can difference the terms of the series and test the resulting series for stationarity, then repeat as necessary until you obtain a stationary time series. This is where the integrated (I) term of the ARIMA(p,d,q) model comes from, as a time series that has been differenced d times and is an integral of order d.

The general expression for an ARIMA(p,d,q) model is

Δdkt=c+α1Δdkt-1+...+αpΔdkt-p+ϵt+θ1ϵt-1+....+θqϵt-q     (7),

where Δ is the difference operator. Fitting an ARIMA(p,d,q) model requires determining values for p, d, and q. You can do this using the Box-Jenkins methodology (see Select ARIMA Model for Time Series Using Box-Jenkins Methodology (Econometrics Toolbox)) as a guide:

  1. Preliminary analysis of the time series ktˆ to determine the order of integration d

  2. Identification of the orders (p,q) using the autocorrelation function (ACF) and partial ACF (PACF) outputs

  3. Parameter estimation using the ARIMA estimate (Econometrics Toolbox) method

  4. Using the residuals of the fitted model to check the goodness of fit

  5. Forecastingfuture time series values, in this case against the 2020 pandemic year

Formulation

Preliminary Analysis of ktˆ

Plot ktˆ, as well as Δktˆ and Δ2ktˆ.

% Plot the mortality index first order difference.
figure
tiledlayout(3,1,'TileSpacing','compact');
nexttile
plot(k_hat)
xlabel('Years')
ylabel('k_t')
title('US Mortality Index')

% Plot the first order difference.
nexttile
k_hat_Diff = diff(k_hat); 
plot(k_hat_Diff)
xlabel('Years')
ylabel('\Delta k_t')
title('US Mortality Index First Order Difference')

% Plot the second order difference.
nexttile
k_hat_Diff_2 = diff(k_hat_Diff); 
plot(k_hat_Diff_2)
xlabel('Years')
ylabel('\Delta^2 k_t')
title('US Mortality Index Second Order Difference')

ktˆ follows a linearly decreasing trend for at least part of the time series, and therefore cannot be assumed stationary. However, you can check if that the time series is trend stationary, so you can perform the Augmented Dickey-Fuller (ADF) test using adftest (Econometrics Toolbox) with the 'Model' selection set to Trend Stationary ('TS'). Test lags 0–2 because 3 and 4 are numerically difficult, and because longer lags are generally less common. Also, there are relatively few data points, implying a strong preference for a parsimonious model. Lastly, if the ADF test returns a positive result for lags 3 and 4, but not 0–2, you cannot assume trend stationarity.

adftest(k_hat,'Model','TS','Lags',[0,1,2])
ans = 1x3 logical array

   0   0   0

The ADF test results are suggestive, but often you want to confirm stationarity results using multiple test methods. A natural pairing for the ADF test is the KPSS test (kpsstest (Econometrics Toolbox)), which switches the null and alternative hypotheses. That is, with the ADF test the null hypothesis is the unit root result, and the alternative hypothesis is the stationary result. With the KPSS test the null hypothesis is the stationary result and the alternative hypothesis is the unit root.

kpsstest(k_hat,'Lags',[0,1,2])
ans = 1x3 logical array

   1   1   1

The KPSS test rejects the stationarity hypothesis for all lags. As expected, you cannot safely assume the raw time series is stationary or trend stationary, so you need to difference the series.

Δktˆ does not have an obvious trend, and therefore may be stationary. Performing an ADF test on the differenced time series, this time with 'Model' set to 'AR', shows mixed results.

adftest(k_hat_Diff,'Model','AR','Lags',[0,1,2])
ans = 1x3 logical array

   1   0   0

With a lag of 0, the ADF test rejects the unit root null hypothesis. However, the ADF test fails to reject the unit root for lags of 1 and 2 (and as demonstrated later, the second order differences do have statistically significant lags). Cross-check the results against the KPSS test.

kpsstest(k_hat_Diff,'Lags',[0,1,2])
ans = 1x3 logical array

   1   1   0

The KPSS test rejects the stationary hypothesis for a lag of 0. The differenced time series might be stationary, but you cannot assume this. Therefore, you must difference the series once more.

Performing the ADF test on the second order time series shows that the series is likely stationary under a model with any of the three lag inputs:

adftest(k_hat_Diff_2,'Model','AR','Lags',[0,1,2])
ans = 1x3 logical array

   1   1   1

Cross-check against the KPSS test:

kpsstest(k_hat_Diff_2,'Lags',[0,1,2])
ans = 1x3 logical array

   0   0   0

The unit root hypothesis is rejected and the stationary hypothesis is not rejected across all lags. Thus, you can confidently fit an ARMA(p,q) model on the second order differences. Equivalently, the ARIMA(p,d,q) model has d = 2.

Identification of the Orders (p,q)

To find the orders of p and q for the ARMA model, you must continue on with the Box-Jenkins methodology (see see Select ARIMA Model for Time Series Using Box-Jenkins Methodology (Econometrics Toolbox)) and inspect the output of the ACF and PACF methods.

The ACF output for Δ2ktˆ shows a single significant lag, and, importantly, shows that the autocorrelations decline gradually:

figure
autocorr(k_hat_Diff_2)
title('Autocorrelation Function of Second Order Difference')

You can also inspect the PACF output:

figure
parcorr(k_hat_Diff_2,'Method','Yule-Walker')
title('Partial Autocorrelation Function of Second Order Difference')

The first partial autocorrelation is significant, after which the correlations drop off immediately. Taken together, these observations suggest an AR(1) model is appropriate where p = 1 and q = 0. A parsimonious model such as this is desirable, given the limited data.

Parameter Estimation

To estimate the model parameters, first specify the arima (Econometrics Toolbox) model that you want to use, then usel estimate (Econometrics Toolbox) with the time series data as an input. You can pass estimate (Econometrics Toolbox) the original ktˆ, as you set d = 2 in the model specification, and therefore estimate (Econometrics Toolbox) looks at the second-order differences.

% Create the ARIMA model
Mdl = arima(1,2,0);
Mdl = estimate(Mdl,k_hat);
 
    ARIMA(1,2,0) Model (Gaussian Distribution):
 
                 Value      StandardError    TStatistic      PValue  
                ________    _____________    __________    __________

    Constant     0.04163       0.16439        0.25323         0.80009
    AR{1}       -0.71182       0.19871        -3.5822      0.00034065
    Variance     0.53307       0.19135         2.7858       0.0053396

Therefore, the model representing the mortality index is

Δ2kt=0.04-0.71Δ2kt-1+ ϵt.

Plot the original time series with the fitted ARIMA model, as well as the residuals to further confirm that the chosen model fits the original time series.

% Compute the residuals
Res = infer(Mdl,k_hat);

% Plot time series vs ARIMA fit
figure
tiledlayout(2,1,'TileSpacing','compact');
nexttile
plot(k_hat);
hold on
% The ARIMA fit is simply the original time series minus the residuals
plot(k_hat-Res,'-')
legend({'k_t','ARIMA Fit'})
hold off
xlabel('Years')
ylabel('Mortality Index')
title('Model Fit')

nexttile
% Plot the residuals
plot(Res)
hold on
% Plot x axis (residual = 0 value)
plot([1,25],[0,0],'-')
hold off
xlabel('Years')
title('Residual Plot')

Check Goodness of Fit

Plot the standardized residuals and autocorrelations and make a QQ plot to check that the residuals.

% Plot the standardized residuals, qqplot, and correlations
figure
t = tiledlayout(2,2,'TileSpacing','compact');
nexttile
plot(Res./sqrt(Mdl.Variance))
title('Standardized Residuals')
nexttile
qqplot(Res)
nexttile
autocorr(Res)
nexttile
parcorr(Res,'Method','Yule-Walker')

% Adjust text sizes
hvec = findall(gcf,'Type','axes');
set(hvec,'TitleFontSizeMultiplier',0.8,...
    'LabelFontSizeMultiplier',0.8);

The QQ plot shows a reasonably normal distribution for the residuals. Similarly, both the ACF and PACF plots have no statistically significant correlated lags, although both outputs show a large correlation on the seventh lag that is just below significance. Overall, the residuals suggest that the fitted model is reasonable, at least given the small data set.

Another validation option is forecasting out-of-sample data points. Ideally, you would remove five data points from the end of the time series for validation purposes. However, this data set only contains 24 time periods, two of which are lost from differencing. If you removed another five, the fitted model would be based on a very small sample size. Additionally, the final year in the time series is 2020, which, due to the SARS-COV-2 pandemic, is likely to be a tail event in the mortality series. This makes 2020 a problematic year to use as an out-of-sample validation check. As such, you cannot validate this model by forecasting against validation data, and you need to take a closer look at 2020.

Model Forecast and Pandemic

You remove the year 2020 from the data set before fitting the ARIMA model, as it is likely an unusual year for mortality data. That 2020 is an outlier does not necessarily imply that the data point should be removed, however, as with a large data set outliers provide information about possible tail events. With a small data set, however, an unusual event like the SARS-CoV-2 pandemic is likely to skew the fitted parameters, and therefore you can reasonably remove 2020 from the time series. So, you have one (unusual) data point that you can forecast against. The following comparison should not be considered a form of validation, but instead it is a brief investigation into the impact that SARS-COV-2 had on annual mortality.

% Find k_hat for 2020
Y2020 = Data.Variables;
Y2020 = Y2020(:,end);

% Remove the age group '100+', which always has a mortality rate of 1.
Y2020 = Y2020(1:end-1);

% Transform Y2020
Y2020 = log(Y2020) - a_hat;

% To get the first principle component we must project our new data point
% onto the first principle direction: <v_new,v_basis>/<v_basis,v_basis>
Top = Y2020'*b_hat;
Bottom = b_hat'*b_hat;
k_hat2020 = Top/Bottom;

% Model forecast
Tf = 1;
[Y,YMSE] = forecast(Mdl,Tf,k_hat);

% Lower and upper 95% confidence bounds
YLower = Y-1.96*sqrt(YMSE);
YUpper = Y+1.96*sqrt(YMSE);

% Plot time series and forecast
figure
plot(Years(1:end-1),k_hat)
hold on
h1 = plot(Years(end-1:end),[k_hat(end);YLower],'r-.');
plot(Years(23:end),[k_hat(end);YUpper],'r-.')
h2 = plot(Years(end-1:end),[k_hat(end);Y],'k--');
h3 = plot(Years(end),k_hat2020,'bo');
hold off
xlabel('Years')
ylabel('k_t')
title('30-Year Mortality Index Forecast')
legend([h1 h2 h3],{'95% confidence interval','Forecast','2020'},'Location',"best")

The observed k_hat value for 2020 does appear to be an outlier. To understand what this actually means for mortality, you must translate this back into the multi-dimensional age profile:

% Forecasted mortality
Exponent = a_hat + b_hat*Y;
mxt = exp(Exponent);

% Estimated mortality using the observed k_hat for 2020
Exponent2020 = a_hat + b_hat*k_hat2020;
mxt2020 = exp(Exponent2020);

% Adjust MinAge and MaxAge to see the predicted VS observed mortality
% rates over a specific age range
MinAge = 20;
MaxAge = 65;

figure
% Plot observed mortality rates
plot(MinAge:MaxAge,100*MortalityRates(MinAge:MaxAge,24),'-*')
hold on
% Plot forecasted and estimated mortality rates
plot(MinAge:MaxAge,100*mxt(MinAge:MaxAge),'k-s');
plot(MinAge:MaxAge,100*mxt2020(MinAge:MaxAge),'r-o');
hold off
xlabel('Age Group')
ylabel('Mortality Rate')
title('US Mortality Rates (%)')
legend({'2020 observed mortality rate','2020 mortality derived from forecasted k_t','2020 mortality derived from observed k_t'},'Location','best')

As expected, given the known age profile of SARS-COV-2 mortality, the pandemic appears to have had a larger impact on the elderly, relative to younger age groups. Younger age categories were also impacted, however, which can be further investigated by adjusting the slider values. Therefore, 2020 was not only an unusual year for overall mortality, but the age profile of mortality rates were also affected.

References

  1. Haberman, S. and M. Russolillo. Lee-Carter Mortality Forecasting: Application to the Italian Population. Actuarial Research Paper No. 167 (London: Cass Business School, November 2005).

  2. Chavhan R. N. and R. L. Shinde. "Modeling and Forecasting Mortality Using the Lee-Carter Model for Indian Population Based on Decade-Wise Data." Sri Lankan Journal of Applied Statistics. Vol. 17, No. 1, 2016.

  3. Bro, R., E. Acar, and T. Kolda. Resolving the Sign Ambiguity in the Singular Value Decomposition. Sandia Report: SAND2007-6422 (United States. Dept. of Energy. Office of Scientific and Technical Information, October 2007).

See Also

| |

Related Examples

More About