Documentation

### This is machine translation

Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

## Feature Screening with `screenpredictors`

This example shows how to perform predictor screening using `screenpredictors`. Predictor screening is a type of univariate analysis performed as an early step in the Credit Scorecard Modeling Workflow (Financial Toolbox). Predictor screening is an important preprocessing step when you work with credit scorecards, as data sets can be prohibitively large and have dozens or hundreds of potential predictors.

The goal of screening predictors is to pare down the set of predictors to a subset that is more useful in predicting the response variable based on the calculated metrics. Screening enables you to select the top predictors as ranked by a given metric to train your credit scorecards.

### Load Data

The credit card data table contains a customer ID (`CustID`), nine predictors, and the response variable (`status`). Some of the risk factors are more useful in predicting the probability of a loan default, whereas others are less useful. The screening process helps you select the best subset of predictors.

Although the data set in this example contains only a few predictors, in practice, credit scorecard data sets can be very large. The predictor screening process is important as data sets grow to contain dozens or hundreds of predictors.

```% Load credit card data tables. load CreditCardData % Use the dataMissing data set, which contains some missing values. data = dataMissing; % Identify the ID and response variables. idvar = 'CustID'; responsevar = 'status'; % Examine the structure of the table. disp(head(data));```
``` CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0 6 65 13 Home Owner Employed 48000 59 Yes 968.18 0.15 0 7 34 32 Home Owner Unknown 32000 26 Yes 717.82 0.02 1 8 50 57 Other Employed 51000 33 No 3041.2 0.13 0 ```

### Add Additional Derived Predictors

Often, derivative predictors can capture additional information or produce better metrics results, for example, the ratio of two predictors or a predictor transformation for predictor x, such as x^2 or log(x). To demonstrate this, create a few derived predictors and add them to the data set.

```data.BalanceUtilRatio = data.AMBalance ./ data.UtilRate; data.BalanceIncomeRatio = data.AMBalance ./ data.CustIncome;```

### Compute Metrics

Use `screenpredictors` to compute several measures of risk factor predictiveness. The columns of the output table contain the metrics values for the predictors. The table is sorted by the information value.

`T = screenpredictors(data,'IDVar',idvar,'ResponseVar',responsevar)`
```T=11×7 table InfoValue AccuracyRatio AUROC Entropy Gini Chi2PValue PercentMissing _________ _____________ _______ _______ _______ __________ ______________ CustAge 0.17698 0.1672 0.5836 0.88795 0.42645 0.0020599 0.025 TmWBank 0.15719 0.13612 0.56806 0.89167 0.42864 0.0054591 0 CustIncome 0.15572 0.17758 0.58879 0.891 0.42731 0.0018428 0 BalanceIncomeRatio 0.097073 0.1278 0.5639 0.90024 0.43303 0.11966 0 TmAtAddress 0.094574 0.010421 0.50521 0.90089 0.43377 0.182 0 UtilRate 0.075086 0.035914 0.51796 0.90405 0.43575 0.45546 0 AMBalance 0.07159 0.087142 0.54357 0.90446 0.43592 0.48528 0 BalanceUtilRatio 0.068955 0.026538 0.51327 0.90486 0.43614 0.52517 0 EmpStatus 0.048038 0.10886 0.55443 0.90814 0.4381 0.00037823 0 OtherCC 0.014301 0.044459 0.52223 0.91347 0.44132 0.047616 0 ResStatus 0.0095558 0.049855 0.52493 0.91446 0.44198 0.29879 0.033333 ```

### Threshold Metrics

Set thresholds for the predictors based on several metrics. For each metric, adjust the threshold sliders to set the range of passing values. In the plot, green bars indicate predictors that pass the threshold. Red bars indicate predictors that do not pass the threshold. You can omit predictors that do not "pass" the threshold from the final data set.

First, select predictors based on their information value.

`infovalueThresh = 0.08;`

Visualize the thresholds on the metric values for each predictor using the local function `thresholdPlot`, defined at the end of this example.

`thresholdPlot(T, infovalueThresh, 'InfoValue')` Select predictors based on their accuracy ratio.

```arThresh = 0.08; thresholdPlot(T, arThresh, 'AccuracyRatio')``` ### Screening Summary

Summarize the thresholding results in table form. The last column indicates which of the predictors passed both of the threshold tests and can be included in the final data set to create the credit scorecard. `summaryTable` and `displaySummaryTable` are local functions.

```metrics = {'InfoValue', 'AccuracyRatio'}; thresholds = [infovalueThresh arThresh]; S = summaryTable(T, metrics, thresholds); displaySummaryTable(S)```
``` InfoValue AccuracyRatio PassedAll _________ _____________ _________ CustAge ✔ ✔ ✔ TmWBank ✔ ✔ ✔ CustIncome ✔ ✔ ✔ BalanceIncomeRatio ✔ ✔ ✔ TmAtAddress ✔ ✘ ✘ UtilRate ✘ ✘ ✘ AMBalance ✘ ✔ ✘ BalanceUtilRatio ✘ ✘ ✘ EmpStatus ✘ ✔ ✘ OtherCC ✘ ✘ ✘ ResStatus ✘ ✘ ✘ ```

### Reduce Table

Create a reduced table that contains only the passing predictors. Select only the predictors that pass both of the threshold tests and create a reduced data set. The credit scorecard you create using the reduced data set requires less memory.

```% Get a list of all passing predictors. predictor_list = T.Row; top_predictors = predictor_list(S.PassedAll); % Trim the data table to contain only the ID, passing predictors, and % response. top_predictor_table = data(:,[idvar; top_predictors; responsevar]); % Create the credit scorecard using the screened predictors. sc = creditscorecard(top_predictor_table,'IDVar',idvar,'ResponseVar',responsevar,... 'BinMissingData', true)```
```sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {1x6 cell} NumericPredictors: {1x4 cell} CategoricalPredictors: {1x0 cell} BinMissingData: 1 IDVar: 'CustID' PredictorVars: {1x4 cell} Data: [1200x6 table] ```

### Local Functions

```function passed = thresholdPredictor(T, threshold, metric) % Threshold a predictor and return a logical vector to indicate passing % predictors. % Check which predictors pass the threshold. switch metric case {'InfoValue', 'AccuracyRatio', 'AUROC'} passed = T.(metric) >= threshold; case {'Entropy', 'Gini', 'Chi2PValue', 'PercentMissing'} passed = T.(metric) <= threshold; end end function thresholdPlot(T, threshold, metric) % Plot bar charts to summarize predictor selection based on metrics threholds. % Threshold the predictors. passed = thresholdPredictor(T, threshold, metric); % Get all predictors. predictorNames = T.Row; nPredictors = length(predictorNames); % Create the bar charts. f = figure; ax = axes('parent',f); bAR = bar(ax, 1:nPredictors, T.(metric), 'FaceColor', 'flat'); bAR.CData(passed,:) = repmat([0,1,0],sum(passed),1); bAR.CData(~passed, :) = repmat([1,0,0],sum(~passed),1); ax.TickLabelInterpreter = 'none'; xticks(ax, 1:nPredictors) xticklabels(ax, predictorNames) xtickangle(ax, 45) % Scale the YLim. delta = max(T.(metric)) - min(T.(metric)); d10 = 0.1 * delta; ylim = [min(T.(metric)) - d10 max(T.(metric)) + d10]; set(ax,'YLim',ylim); % Add threshold lines. hold on plot(xlim, [threshold threshold],'k--'); xlabel('Predictor') ylabel(metric) title(sprintf('Predictor Performance by %s',metric)); hold off end function S = summaryTable(T, metrics, thresholds) % Create table summarizing all thresholds. S = T; % Remove metrics that are not thresholded. unthresholded = setdiff(S.Properties.VariableNames, metrics); S(:,unthresholded) = []; % Show thresholding summary. passed_all = true(numel(T.Row),1); for i = 1:numel(metrics) metrici = metrics{i}; thresholdi = thresholds(i); passed = thresholdPredictor(T, thresholdi, metrici); S.(metrici) = passed; passed_all = passed_all & passed; end % Add summary column. S.PassedAll = passed_all; end function displaySummaryTable(S) % Display a summary table with check marks for passed thresholds. cols = S.Properties.VariableNames; % Convert each column to check marks and X marks. for i = 1:numel(cols) coli = cols{i}; charvec = repmat(char(10008),size(S,1),1); % Initalize as 'X'. charvec(S.(coli)) = char(10004); % Check if it passes the threshold. S.(coli) = charvec; end disp(S); end```

#### CECL and IFRS 9 Modeling in MATLAB: Measuring Lifetime Expected Credit Losses

Download white paper