# Cox

Create `Cox`

model object for lifetime probability of
default

## Description

Create and analyze a `Cox`

model object to calculate
lifetime probability of default (PD) using this workflow:

Use

`fitLifetimePDModel`

to create a`Cox`

model object.Use

`predict`

to predict the conditional PD and`predictLifetime`

to predict the lifetime PD.Use

`modelDiscrimination`

to return AUROC and ROC data. You can plot the results using`modelDiscriminationPlot`

.Use

`modelAccuracy`

to return the root mean square error (RMSE) of observed and predicted PD data. You can plot the results using`modelAccuracyPlot`

.

## Creation

### Syntax

### Description

creates a `CoxPDModel`

= fitLifetimePDModel(`data`

,`ModelType`

,'`AgeVar`

',agevar_value)`Cox`

PD model object.

If you do not specify variable information for
`IDVar`

, `LoanVars`

,
`MacroVars`

, and
`ResponseVar`

, then:

`IDVar`

is set to the first column in the`data`

input.`LoanVars`

is set to include all columns from the second to the second-to-last columns of the`data`

input.`ResponseVar`

is set to the last column in the`data`

input.

sets optional properties using additional
name-value pair arguments in addition to the required arguments in the
previous syntax. For example, `CoxPDModel`

= fitLifetimePDModel(___,`Name,Value`

)```
CoxPDModel =
fitLifetimePDModel(data(TrainDataInd,:),"Cox",'ModelID',"Cox_A",'Descripion',"Cox_model",'AgeVar',"YOB",'IDVar',"ID",'LoanVars',"ScoreGroup",'MacroVars',{'GDP','Market'},'ResponseVar',"Default",'TimeInterval',1)
```

creates a `CoxPDModel`

using a `Cox`

model type. You can specify multiple name-value pair arguments.

### Input Arguments

`data`

— Data

table

Data, specified as a table, in panel data form. The data must
contain an `ID`

column and an
`Age`

column. The response variable must be a
binary variable with the value `0`

or
`1`

, with `1`

indicating
default.

**Data Types: **`table`

`ModelType`

— Model type

string with value `"Cox"`

| character vector with value `'Cox'`

Model type, specified as a string with the value
`"Cox"`

or a character vector with the value
`'Cox'`

.

**Data Types: **`char`

| `string`

`Cox`

Name-Value Pair ArgumentsSpecify
required and optional comma-separated pairs of
`Name,Value`

arguments. `Name`

is
the argument name and `Value`

is the corresponding value.
`Name`

must appear inside quotes. You can specify
several name and value pair arguments in any order as
`Name1,Value1,...,NameN,ValueN`

.

**Example:**

```
CoxPDModel =
fitLifetimePDModel(data(TrainDataInd,:),"Cox",'ModelID',"Cox_A",'Descripion',"Cox_model",'AgeVar',"YOB",'IDVar',"ID",'LoanVars',"ScoreGroup",'MacroVars',{'GDP','Market'},'ResponseVar',"Default",'TimeInterval',1)
```

**Required**

`Cox`

Name-Value Pair Argument`AgeVar`

— Age variable indicating which column in `data`

contains loan age information

string | character vector

Age variable indicating which column in
`data`

contains the loan age information,
specified as the comma-separated pair consisting of
`'AgeVar'`

and a string or character vector.

**Note**

The required name-value argument
`AgeVar`

is not treated as a
predictor in the `Cox`

lifetime PD
model. When using a `Cox`

model, you
must specify predictor variables using
`LoanVars`

or
`MacroVars`

. The
`AgeVar`

values are the event times
for the underlying Cox proportional hazards
model.

`AgeVar`

values for each ID should be
increasing. If there are nonpositive age increments,
`fitLifetimePDModel`

warns when you create
a `Cox`

model and removes the IDs with
nonpositive age increments. By default, the
`TimeInterval`

value is set to
the most common age increment in the training
data.

**Data Types: **`string`

| `char`

**Optional**

`Cox`

Name-Value Pair Arguments`ModelID`

— User-defined model ID

`Cox`

(default) | string | character vector

User-defined model ID, specified as the comma-separated pair
consisting of `'ModelID'`

and a string or
character vector. The software uses the
`ModelID`

to format outputs and is expected
to be short.

**Data Types: **`string`

| `char`

`Description`

— User-defined description for model

`""`

(default) | string | character vector

User-defined description for model, specified as the
comma-separated pair consisting of
`'Description'`

and a string or character
vector.

**Data Types: **`string`

| `char`

`IDVar`

— ID variable indicating which column in `data`

contains loan or borrower ID

1st column of `data`

(default) | string | character vector

ID variable indicating which column in
`data`

contains the loan or borrower ID,
specified as the comma-separated pair consisting of
`'IDVar'`

and a string or character
vector.

**Data Types: **`string`

| `char`

`LoanVars`

— Loan variables indicating which column in `data`

contains loan-specific
information

all columns of `data`

that
are not the first or last column (default) | string array | cell array of character vectors

Loan variables indicating which column in
`data`

contains the loan-specific
information, such as origination score or loan-to-value ratio,
specified as the comma-separated pair consisting of
`'LoanVars'`

and a string array or cell
array of character vectors.

**Data Types: **`string`

| `cell`

`MacroVars`

— Macro variables indicating which column in `data`

contains macroeconomic
information

`""`

(default) | string array | cell array of character vectors

Macro variables indicating which column in
`data`

contains the macroeconomic
information, such as gross domestic product (GDP) growth or
unemployment rate, specified as the comma-separated pair
consisting of `'MacroVars'`

and a string array
or cell array of character vectors.

**Data Types: **`string`

| `cell`

`ResponseVar`

— Variable indicating which column in `data`

contains response variable

last column of `data`

(default) | logical

Variable indicating which column in `data`

contains the response variable, specified as the comma-separated
pair consisting of `'ResponseVar'`

and a
logical value.

**Note**

The response variable in the `data`

must be a binary variable with `0`

or
`1`

values, with
`1`

indicating default.

In Cox lifetime PD models, the
`ResponseVar`

values are define the
censoring information for the underlying Cox
proportional hazards model.

**Data Types: **`logical`

`TimeInterval`

— Distance between age values in panel `data`

input

set to most common `AgeVar`

increment in the training `data`

(default) | positive numeric

Distance between age values in training data in the panel
`data`

input, specified as the
comma-separated pair consisting of
`'TimeInterval'`

and a positive numeric
scalar.

Use the `'TimeInterval'`

name-value argument
to fit time-dependent models and also as the time interval for
the PD computation when you use the `predict`

function. For example, if the age data
(`AgeVar`

) is 1, 2, 3, ..., then the
`TimeInterval`

is `1`

; if
the age data is 0.25, 0.5, 0.75,..., then the
`TimeInterval`

is `0.25`

.
For more information, see Time Interval for Cox Models and Lifetime Prediction and Time Interval.

**Note**

Unlike `Logistic`

and `Probit`

models, a `Cox`

model requires an `AgeVar`

variable.
By default, if you do not specify a
`TimeInterval`

when creating a
`Cox`

model, the
`TimeInterval`

is inferred from the
increments in the `AgeVar`

values in
the training `data`

.

**Data Types: **`double`

## Properties

`ModelID`

— User-defined model ID

`Probit`

(default) | string

User-defined model ID, returned as a string.

**Data Types: **`string`

`Description`

— User-defined description

`""`

(default) | string

User-defined description, returned as a string.

**Data Types: **`string`

`IDVar`

— ID variable indicating which column in `data`

contains loan or borrower ID

1st column of `data`

(default) | string

ID variable indicating which column in `data`

contains the loan or borrower ID, returned as a string.

**Data Types: **`string`

`AgeVar`

— Age variable indicating which column in `data`

contains loan age information

string

Age variable indicating which column in `data`

contains the loan age information, returned as a string.

**Data Types: **`string`

`LoanVars`

— Loan variables indicating which column in `data`

contains loan-specific information

all columns of `data`

that are not
the first or last column (default) | string array

Loan variables indicating which column in `data`

contains the loan-specific information, returned as a string
array.

**Data Types: **`string`

`MacroVars`

— Macro variables indicating which column in `data`

contains macroeconomic information

`""`

(default) | string array

Macro variables indicating which column in `data`

contains the macroeconomic information, returned as a string
array.

**Data Types: **`string`

`ResponseVar`

— Variable indicating which column in `data`

contains response variable

last column of `data`

(default) | string

Variable indicating which column in `data`

contains
the response variable, returned as a string.

**Data Types: **`string`

`TimeInterval`

— Distance between age values in panel `data`

input

set to most common `AgeVar`

increment
in the training data (default) | positive numeric

This property is read-only.

Distance between age values in panel `data`

input,
returned as a scalar positive numeric.

**Data Types: **`double`

`ExtrapolationFactor`

— Extrapolation factor

`1`

(default) | positive numeric between `0`

and
`1`

Extrapolation factor, returned as a positive numeric scalar between
`0`

and `1`

.

By default, the `ExtrapolationFactor`

is set to
`1`

. For age values (`AgeVar`

)
greater than the maximum age observed in the training data, the
conditional PD, computed with `predict`

,
uses the maximum age observed in the training data. In particular, the
predicted PD value is constant if the predictor values do not change and
only the age values change when the
`ExtrapolationFactor`

is `1`

. For
more information, see Extrapolation for Cox Models, Extrapolation Factor for Cox Models, and Use Cox Lifetime PD Model to Predict Conditional PD.

**Data Types: **`double`

## Object Functions

`predict` | Compute conditional PD |

`predictLifetime` | Compute cumulative lifetime PD, marginal PD, and survival probability |

`modelDiscrimination` | Compute AUROC and ROC data |

`modelAccuracy` | Compute RMSE of predicted and observed PDs on grouped data |

`modelDiscriminationPlot` | Plot ROC curve |

`modelAccuracyPlot` | Plot observed default rates compared to predicted PDs on grouped data |

## Examples

### Create Cox Lifetime PD Model

This example shows how to use `fitLifetimePDModel`

to create a `Cox`

model using credit and macroeconomic data.

**Load Data**

Load the credit portfolio data.

```
load RetailCreditPanelData.mat
disp(head(data))
```

ID ScoreGroup YOB Default Year __ __________ ___ _______ ____ 1 Low Risk 1 0 1997 1 Low Risk 2 0 1998 1 Low Risk 3 0 1999 1 Low Risk 4 0 2000 1 Low Risk 5 0 2001 1 Low Risk 6 0 2002 1 Low Risk 7 0 2003 1 Low Risk 8 0 2004

disp(head(dataMacro))

Year GDP Market ____ _____ ______ 1997 2.72 7.61 1998 3.57 26.24 1999 2.86 18.1 2000 2.43 3.19 2001 1.26 -10.51 2002 -0.59 -22.95 2003 0.63 2.78 2004 1.85 9.48

Join the two data components into a single data set.

data = join(data,dataMacro); disp(head(data))

ID ScoreGroup YOB Default Year GDP Market __ __________ ___ _______ ____ _____ ______ 1 Low Risk 1 0 1997 2.72 7.61 1 Low Risk 2 0 1998 3.57 26.24 1 Low Risk 3 0 1999 2.86 18.1 1 Low Risk 4 0 2000 2.43 3.19 1 Low Risk 5 0 2001 1.26 -10.51 1 Low Risk 6 0 2002 -0.59 -22.95 1 Low Risk 7 0 2003 0.63 2.78 1 Low Risk 8 0 2004 1.85 9.48

**Partition Data**

Separate the data into training and test partitions.

nIDs = max(data.ID); uniqueIDs = unique(data.ID); rng('default'); % For reproducibility c = cvpartition(nIDs,'HoldOut',0.4); TrainIDInd = training(c); TestIDInd = test(c); TrainDataInd = ismember(data.ID,uniqueIDs(TrainIDInd)); TestDataInd = ismember(data.ID,uniqueIDs(TestIDInd));

**Create a Cox Lifetime PD Model**

Use `fitLifetimePDModel`

to create a `Cox`

model using the training data.

pdModel = fitLifetimePDModel(data(TrainDataInd,:),"Cox",... 'AgeVar','YOB',... 'IDVar','ID',... 'LoanVars','ScoreGroup',... 'MacroVars',{'GDP','Market'},... 'ResponseVar','Default'); disp(pdModel)

Cox with properties: TimeInterval: 1 ExtrapolationFactor: 1 ModelID: "Cox" Description: "" Model: [1x1 CoxModel] IDVar: "ID" AgeVar: "YOB" LoanVars: "ScoreGroup" MacroVars: ["GDP" "Market"] ResponseVar: "Default"

Display the underlying model.

disp(pdModel.Model)

Cox Proportional Hazards regression model: Beta SE zStat pValue __________ _________ _______ ___________ ScoreGroup_Medium Risk -0.6794 0.037029 -18.348 3.4442e-75 ScoreGroup_Low Risk -1.2442 0.045244 -27.501 1.7116e-166 GDP -0.084533 0.043687 -1.935 0.052995 Market -0.0084411 0.0032221 -2.6198 0.0087991

**Validate Model**

Use `modelDiscrimination`

to measure the ranking of customers by PD.

DataSetChoice = "Testing"; if DataSetChoice=="Training" Ind = TrainDataInd; else Ind = TestDataInd; end DiscMeasure = modelDiscrimination(pdModel,data(Ind,:),'SegmentBy','ScoreGroup')

`DiscMeasure=`*3×1 table*
AUROC
_______
Cox, ScoreGroup=High Risk 0.64112
Cox, ScoreGroup=Medium Risk 0.61989
Cox, ScoreGroup=Low Risk 0.6314

disp(DiscMeasure)

AUROC _______ Cox, ScoreGroup=High Risk 0.64112 Cox, ScoreGroup=Medium Risk 0.61989 Cox, ScoreGroup=Low Risk 0.6314

Use `modelDiscriminationPlot`

to visualize the ROC curve.

modelDiscriminationPlot(pdModel,data(Ind,:),'SegmentBy','ScoreGroup')

Use `modelAccuracy`

to measure the accuracy (or calibration) of the predicted PD values. The `modelAccuracy`

function requires a grouping variable and compares the accuracy of the observed default rate in the group with the average predicted PD for the group.

AccMeasure = modelAccuracy(pdModel,data(Ind,:),{'YOB','ScoreGroup'})

`AccMeasure=`*table*
RMSE
_________
Cox, grouped by YOB, ScoreGroup 0.0012471

disp(AccMeasure)

RMSE _________ Cox, grouped by YOB, ScoreGroup 0.0012471

Use `modelAccuracyPlot`

to visualize the observed default rates compared to the predicted PD.

modelAccuracyPlot(pdModel,data(Ind,:),{'YOB','ScoreGroup'})

**Predict Conditional and Lifetime PD**

Use the `predict`

function to predict conditional PD values. The prediction is a row-by-row prediction.

```
%dataCustomer1 = data(1:8,:);
CondPD = predict(pdModel,data(Ind,:))
```

`CondPD = `*258627×1*
0.0162
0.0091
0.0081
0.0073
0.0064
0.0072
0.0030
0.0016
0.0162
0.0091
⋮

Use `predictLifetime`

to predict the lifetime cumulative PD values (computing marginal and survival PD values is also supported).

LifetimePD = predictLifetime(pdModel,data(Ind,:))

`LifetimePD = `*258627×1*
0.0162
0.0251
0.0330
0.0400
0.0461
0.0530
0.0559
0.0574
0.0162
0.0251
⋮

## More About

### Cox Proportional Hazards Models

The *Cox proportional hazards* (PH)
model is a survival model and it models the time until an event of interest
occurs.

For probability of default (PD) models, the event of interest is the default
on a credit obligation. `Cox`

models need information on
whether there was a default and when it happened. For other commonly used PD
models, a binary variable indicating whether there was a default is enough.
`Cox`

PD models need that information, plus the age of the
loan at the time of default.

The `Cox`

proportional hazards (PH) model, also known as a
`Cox`

regression model, assumes the hazard rate is of the form

$$h(t;X)={h}_{0}(t)\mathrm{exp}(X\beta )$$

where

*h*_{0}(*t*) is the baseline hazard rate.*X*is the predictor data.β is a vector of coefficients of the predictors.

exp(

*X*β) is the hazard ratio.

The baseline hazard rate is a reference hazard level, common to all
observations, and it does not depend on the predictor values. The hazard ratio
is the factor that scales the baseline hazard value up or down, depending on the
predictor values. For lower risk observations, the hazard ratio is less than
`1`

and this reduces the hazard rate. For higher risk
observations, the hazard ratio increases the hazard rate.

In the hazard rate formula, the predictor values in *X* are
fixed, or *independent of time*. This is the
basic version of the `Cox`

PH model. For PD models, the basic
version of the `Cox`

PH model includes predictors that have
constant values, such as the origination score, or whether a property is for
residential or commercial purposes.

The *time-dependent *
`Cox`

PH model allows predictor values to change over time. For
example, the loan-to-value (LTV) ratio changes over the life of a loan, and the
macroeconomic variables change from period to period. Therefore, the following
hazard rate formula for time-dependent models includes predictor values that can
be a function of time:

$$h(t;X)={h}_{0}(t)\mathrm{exp}(X(t)\beta )$$

The `data`

input for `fitLifetimePDModel`

must be in panel data form. For each ID
(`IDVar`

), there are multiple rows of data. The panel
`data`

input is required for both time-dependent and time
-independent models.

For time-independent predictors, the predictor value is constant for each ID.
For example, the score at origination for each customer is constant throughout
the life of the loan, and this value is repeated for each row corresponding to
the same ID in the panel `data`

format.

For time-dependent predictors, the values may change from one row to the next
for the same ID. The assumption is that the predictor values in each row are
valid in the time interval defined by the age value
(`AgeVar`

) in the previous row and the age value in the
current row.

### Time Interval for `Cox`

Models

Time is discretized into intervals, and predictor values in
the training data (`data`

input) are constant for each
interval: *X*_{1} from
*t*_{0} to
*t*_{1};
*X*_{2} from
*t*_{1} to
*t*_{2}; and so forth.

The `data`

input must be in panel data form, with multiple
observations for each ID, with corresponding age information (the
*t*_{k} values, the
`AgeVar`

column) and the corresponding default indicator
values (the `ResponseVar`

column).

Assume that *t*_{k} -
*t*_{k - 1} =
Δ*t* for all *k* and this is the
*time interval*. This time interval is the age
increment for consecutive observations in the age data
(`AgeVar`

). The assumption is that these increments are
regular and that the default indicator (`ResponseVar`

) is
defined consistently with this time interval, in the sense that a
`1`

means there was a default in a time interval of length
Δ*t*. The time interval Δ*t* is also used
for the computation of the probability of default. For more information, see
Lifetime Prediction and Time Interval.

### Survival and Probability of Default for `Cox`

Models

The survival function
*S*(*t*) is a function of time, and gives
the probability of surviving longer than a given time
*t*.

$$S(t)=P(T>t)$$

where

*T*is the failure time, the random variable of interest, and in the`Cox`

model case, the time to default.*t*is the specific time of interest, for example, 1 year.

The main relationship between the survival function and the hazard rate is

$$S(t)=\mathrm{exp}\left(-{\displaystyle {\int}_{0}^{t}h(u)du}\right)$$

Higher values of the hazard rate cause the survival probability to drop faster. Conversely, lower values of the hazard rate cause the survival probability to rise faster.

The probability of default (PD) is the conditional probability of defaulting
in a time interval, given that there has been no default prior to that interval.
For example, the probability of default between time *s* and
*t*, with *s* < *t*,
is represented as:

$$\begin{array}{l}PD(s,t)=P(s<T\le t|T>s)\\ \text{=}\frac{S(s)-S(t)}{S(s)}\\ \text{=1-}\frac{S(t)}{S(s)}\end{array}$$

In credit applications, the time interval of interest, Δ*t*,
is consistent with the training data and the definition of default in the
response variable. The PD is a function of a single time variable
*t* and the implicit time interval Δ*t*:

$$PD(t)=1-\frac{S(t)}{S(t-\Delta t)}$$

## References

[1] Baesens, Bart, Daniel
Roesch, and Harald Scheule. *Credit Risk Analytics: Measurement
Techniques, Applications, and Examples in SAS.* Wiley,
2016.

[2] Bellini, Tiziano.
*IFRS 9 and CECL Credit Risk Modelling and Validation: A Practical
Guide with Examples Worked in R and SAS.* San Diego, CA: Elsevier,
2019.

[3] Breeden, Joseph.
*Living with CECL: The Modeling Dictionary.* Santa Fe, NM:
Prescient Models LLC, 2018.

[4] Roesch, Daniel and Harald
Scheule. *Deep Credit Risk: Machine Learning with Python.*
Independently published, 2020.

## See Also

### Functions

### Topics

- Basic Lifetime PD Model Validation
- Compare Logistic Model for Lifetime PD to Champion Model
- Compare Lifetime PD Models Using Cross-Validation
- Expected Credit Loss Computation
- Compare Model Discrimination and Accuracy to Validate of Probability of Default
- Compare Probability of Default Using Through-the-Cycle and Point-in-Time Models
- Modeling Probabilities of Default with Cox Proportional Hazards
- Overview of Lifetime Probability of Default Models

**Introduced in R2021b**

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

# Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)