This example describes the assignment of points for missing data when the 'BinMissingData'
option is set to true
.
Predictors that have missing data in the training set have an explicit bin for <missing>
with corresponding points in the final scorecard. These points are computed from the Weight-of-Evidence (WOE) value for the <missing>
bin and the logistic model coefficients. For scoring purposes, these points are assigned to missing values and to out-of-range values.
Predictors with no missing data in the training set have no <missing>
bin, therefore no WOE can be estimated from the training data. By default, the points for missing and out-of-range values are set to NaN
, and this leads to a score of NaN
when running score
. For predictors that have no explicit <missing>
bin, use the name-value argument 'Missing'
in formatpoints
to indicate how missing data should be treated for scoring purposes.
Create a creditscorecard
object using the CreditCardData.mat
file to load the dataMissing
with missing values.
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status
______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______
1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0
2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0
3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0
4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0
5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0
Number of missing values CustAge: 30
Number of missing values ResStatus: 40
Use creditscorecard
with the name-value argument 'BinMissingData'
set to true
to bin the missing numeric or categorical data in a separate bin. Apply automatic binning.
creditscorecard with properties:
GoodLabel: 0
ResponseVar: 'status'
WeightsVar: ''
VarNames: {'CustID' 'CustAge' 'TmAtAddress' 'ResStatus' 'EmpStatus' 'CustIncome' 'TmWBank' 'OtherCC' 'AMBalance' 'UtilRate' 'status'}
NumericPredictors: {'CustAge' 'TmAtAddress' 'CustIncome' 'TmWBank' 'AMBalance' 'UtilRate'}
CategoricalPredictors: {'ResStatus' 'EmpStatus' 'OtherCC'}
BinMissingData: 1
IDVar: 'CustID'
PredictorVars: {'CustAge' 'TmAtAddress' 'ResStatus' 'EmpStatus' 'CustIncome' 'TmWBank' 'OtherCC' 'AMBalance' 'UtilRate'}
Data: [1200x11 table]
Set a minimum value of zero for CustAge
and CustIncome
. With this, any negative age or income information becomes invalid or "out-of-range". For scoring purposes, out-of-range values are given the same points as missing values.
Display and plot bin information for numeric data for 'CustAge'
that includes missing data in a separate bin labelled <missing>
.
Bin Good Bad Odds WOE InfoValue
_____________ ____ ___ ______ ________ __________
{'[0,33)' } 69 52 1.3269 -0.42156 0.018993
{'[33,37)' } 63 45 1.4 -0.36795 0.012839
{'[37,40)' } 72 47 1.5319 -0.2779 0.0079824
{'[40,46)' } 172 89 1.9326 -0.04556 0.0004549
{'[46,48)' } 59 25 2.36 0.15424 0.0016199
{'[48,51)' } 99 41 2.4146 0.17713 0.0035449
{'[51,58)' } 157 62 2.5323 0.22469 0.0088407
{'[58,Inf]' } 93 25 3.72 0.60931 0.032198
{'<missing>'} 19 11 1.7273 -0.15787 0.00063885
{'Totals' } 803 397 2.0227 NaN 0.087112
Display and plot bin information for categorical data for 'ResStatus'
that includes missing data in a separate bin labelled <missing>
.
Bin Good Bad Odds WOE InfoValue
______________ ____ ___ ______ _________ __________
{'Tenant' } 296 161 1.8385 -0.095463 0.0035249
{'Home Owner'} 352 171 2.0585 0.017549 0.00013382
{'Other' } 128 52 2.4615 0.19637 0.0055808
{'<missing>' } 27 13 2.0769 0.026469 2.3248e-05
{'Totals' } 803 397 2.0227 NaN 0.0092627
For the 'CustAge'
and 'ResStatus'
predictors, there is missing data (NaN
s and <undefined>
) in the training data, and the binning process estimates a WOE value of -0.15787
and 0.026469
respectively for missing data in these predictors, as shown above.
For EmpStatus
and CustIncome
there is no explicit bin for missing values because the training data has no missing values for these predictors.
Bin Good Bad Odds WOE InfoValue
____________ ____ ___ ______ ________ _________
{'Unknown' } 396 239 1.6569 -0.19947 0.021715
{'Employed'} 407 158 2.5759 0.2418 0.026323
{'Totals' } 803 397 2.0227 NaN 0.048038
Bin Good Bad Odds WOE InfoValue
_________________ ____ ___ _______ _________ __________
{'[0,29000)' } 53 58 0.91379 -0.79457 0.06364
{'[29000,33000)'} 74 49 1.5102 -0.29217 0.0091366
{'[33000,35000)'} 68 36 1.8889 -0.06843 0.00041042
{'[35000,40000)'} 193 98 1.9694 -0.026696 0.00017359
{'[40000,42000)'} 68 34 2 -0.011271 1.0819e-05
{'[42000,47000)'} 164 66 2.4848 0.20579 0.0078175
{'[47000,Inf]' } 183 56 3.2679 0.47972 0.041657
{'Totals' } 803 397 2.0227 NaN 0.12285
Use fitmodel
to fit a logistic regression model using Weight of Evidence (WOE) data. fitmodel
internally transforms all the predictor variables into WOE values, using the bins found with the automatic binning process. fitmodel
then fits a logistic regression model using a stepwise method (by default). For predictors that have missing data, there is an explicit <missing>
bin, with a corresponding WOE value computed from the data. When using fitmodel
, the corresponding WOE value for the <missing>
bin is applied when performing the WOE transformation.
1. Adding CustIncome, Deviance = 1490.8527, Chi2Stat = 32.588614, PValue = 1.1387992e-08
2. Adding TmWBank, Deviance = 1467.1415, Chi2Stat = 23.711203, PValue = 1.1192909e-06
3. Adding AMBalance, Deviance = 1455.5715, Chi2Stat = 11.569967, PValue = 0.00067025601
4. Adding EmpStatus, Deviance = 1447.3451, Chi2Stat = 8.2264038, PValue = 0.0041285257
5. Adding CustAge, Deviance = 1442.8477, Chi2Stat = 4.4974731, PValue = 0.033944979
6. Adding ResStatus, Deviance = 1438.9783, Chi2Stat = 3.86941, PValue = 0.049173805
7. Adding OtherCC, Deviance = 1434.9751, Chi2Stat = 4.0031966, PValue = 0.045414057
Generalized linear regression model:
logit(status) ~ 1 + CustAge + ResStatus + EmpStatus + CustIncome + TmWBank + OtherCC + AMBalance
Distribution = Binomial
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ ______ __________
(Intercept) 0.70229 0.063959 10.98 4.7498e-28
CustAge 0.57421 0.25708 2.2335 0.025513
ResStatus 1.3629 0.66952 2.0356 0.04179
EmpStatus 0.88373 0.2929 3.0172 0.002551
CustIncome 0.73535 0.2159 3.406 0.00065929
TmWBank 1.1065 0.23267 4.7556 1.9783e-06
OtherCC 1.0648 0.52826 2.0156 0.043841
AMBalance 1.0446 0.32197 3.2443 0.0011775
1200 observations, 1192 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 88.5, p-value = 2.55e-16
Scale the scorecard points by the "points, odds, and points to double the odds (PDO)" method using the 'PointsOddsAndPDO'
argument of formatpoints
. Suppose that you want a score of 500 points to have odds of 2 (twice as likely to be good than to be bad) and that the odds double every 50 points (so that 550 points would have odds of 4).
Display the scorecard showing the scaled points for predictors retained in the fitting model.
PointsInfo=38×3 table
Predictors Bin Points
_____________ ______________ ______
{'CustAge' } {'[0,33)' } 54.062
{'CustAge' } {'[33,37)' } 56.282
{'CustAge' } {'[37,40)' } 60.012
{'CustAge' } {'[40,46)' } 69.636
{'CustAge' } {'[46,48)' } 77.912
{'CustAge' } {'[48,51)' } 78.86
{'CustAge' } {'[51,58)' } 80.83
{'CustAge' } {'[58,Inf]' } 96.76
{'CustAge' } {'<missing>' } 64.984
{'ResStatus'} {'Tenant' } 62.138
{'ResStatus'} {'Home Owner'} 73.248
{'ResStatus'} {'Other' } 90.828
{'ResStatus'} {'<missing>' } 74.125
{'EmpStatus'} {'Unknown' } 58.807
{'EmpStatus'} {'Employed' } 86.937
{'EmpStatus'} {'<missing>' } NaN
⋮
Notice that points for the <missing>
bin for CustAge
and ResStatus
are explicitly shown (as 64.9836
and 74.1250
, respectively). These points are computed from the WOE value for the <missing>
bin, and the logistic model coefficients.
For predictors that have no missing data in the training set, there is no explicit <missing>
bin. By default the points are set to NaN
for missing data and they lead to a score of NaN
when running score
. For predictors that have no explicit <missing>
bin, use the name-value argument 'Missing'
in formatpoints
to indicate how missing data should be treated for scoring purposes.
For the purpose of illustration, take a few rows from the original data as test data and introduce some missing data. Also introduce some invalid, or out-of-range values. For numeric data, values below the minimum (or above the maximum) allowed are considered invalid, such as a negative value for age (recall 'MinValue'
was earlier set to 0
for CustAge
and CustIncome
). For categorical data, invalid values are categories not explicitly included in the scorecard, for example, a residential status not previously mapped to scorecard categories, such as "House", or a meaningless string such as "abc123".
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance
_______ ___________ ___________ __________ _______ _______ _________
NaN Tenant Unknown 34000 44 Yes 119.8
48 <undefined> Unknown 44000 14 Yes 403.62
65 Home Owner <undefined> 48000 6 No 111.88
44 Other Unknown NaN 35 No 436.41
-100 Other Employed 46000 16 Yes 162.21
33 House Employed 36000 36 Yes 845.02
39 Tenant Freelancer 34000 40 Yes 756.26
24 Home Owner Employed -1 19 Yes 449.61
Score the new data and see how points are assigned for missing CustAge
and ResStatus
, because we have an explicit bin with points for <missing>
. However, for EmpStatus
and CustIncome
the score
function sets the points to NaN
.
481.2231
520.8353
NaN
NaN
551.7922
487.9588
NaN
NaN
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance
_______ _________ _________ __________ _______ _______ _________
64.984 62.138 58.807 67.893 61.858 75.622 89.922
78.86 74.125 58.807 82.439 61.061 75.622 89.922
96.76 73.248 NaN 96.969 51.132 50.914 89.922
69.636 90.828 58.807 NaN 61.858 50.914 89.922
64.984 90.828 86.937 82.439 61.061 75.622 89.922
56.282 74.125 86.937 70.107 61.858 75.622 63.028
60.012 62.138 NaN 67.893 61.858 75.622 63.028
54.062 73.248 86.937 NaN 61.061 75.622 89.922
Use the name-value argument 'Missing'
in formatpoints
to choose how to assign points to missing values for predictors that do not have an explicit <missing>
bin. In this example, use the 'MinPoints'
option for the 'Missing'
argument. The minimum points for EmpStatus
in the scorecard displayed above are 58.8072
, and for CustIncome
the minimum points are 29.3753
.
481.2231
520.8353
517.7532
451.3405
551.7922
487.9588
449.3577
470.2267
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance
_______ _________ _________ __________ _______ _______ _________
64.984 62.138 58.807 67.893 61.858 75.622 89.922
78.86 74.125 58.807 82.439 61.061 75.622 89.922
96.76 73.248 58.807 96.969 51.132 50.914 89.922
69.636 90.828 58.807 29.375 61.858 50.914 89.922
64.984 90.828 86.937 82.439 61.061 75.622 89.922
56.282 74.125 86.937 70.107 61.858 75.622 63.028
60.012 62.138 58.807 67.893 61.858 75.622 63.028
54.062 73.248 86.937 29.375 61.061 75.622 89.922