What is the reference category in the output for a Fitlme with categorical variables and three-way interaction terms?

Question

Robert Joniec 2020 年 5 月 10 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/524368-what-is-the-reference-category-in-the-output-for-a-fitlme-with-categorical-variables-and-three-way-i

コメント済み: Peng Li 2020 年 5 月 28 日

Below table summarizes the output of a mixed linear model with random intercept and slope run on structured panel data ('tbl_early'), where the model specifies as:

lme_PrimaryHU = fitlme(tbl_early, 'logRoL ~ 1 + logLoL + logAnnLioL + Dur + PPI + AvgEffTax_1 + HU + logEP*logAP + EQ*PrimaryHU*relInsLoss_1 + Wstorm*relInsLoss_1 +Storm*PrimaryHU*relInsLoss_1 +(FFR|ID)')

'Dur' has 4 levels and therefore I understood that the output shows three levels with estimates that relate to the fourth, i.e. the reference level ('Dur_one'). From the results one could interpret that Dur_onehalf trades at a discount if compared to Dur_one, all else equal.

'HU', 'Storm', 'EQ' and 'Wstorm' are binary variables, they are not mutually exclusive (cross-sectional analysis) and there is no case in the data in which all of them would be 0. Thus the question is, which of these variables Matlab chose as reference case. !Note that some of the peril variables are used in two- or three-way interaction terms that appear a bit lower in the table! 'PrimaryHU' is a binary variable that controlls for a certain condition which impacts the potential effects from relInsLoss or 'HU', 'Storm', 'EQ' and 'Wstorm' (e.g. 'EQ' alone is positive but not significant at p<0.1, 'EQ*relInsLoss' is negative and still not significant, 'EQ*primaryHU' is negative and significant, 'EQ*relInsLoss*PrimaryHU' is positive and significant). All remaining variables are continuous.

Two-way interactions used:

'logAP*logEP'
'Wstorm*relInsLoss'

Three-way interactins used:

'Storm*PrimaryHU*relInsLoss'
'EQ*PrimaryHU*relInsLoss

Other interaction terms or underlying variables' seperate estimates should be a product of using above interaction terms.

Many thanks for any help in advance!

Rob

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Peng Li 2020 年 5 月 10 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/524368-what-is-the-reference-category-in-the-output-for-a-fitlme-with-categorical-variables-and-three-way-i#answer_431541

The table you copied isn't the default display from matlab, so it's difficult to tell anything from there. It's like an ANOVA output since items (including interaction items) that are categorical each corresponding to only one line.

As you mentioned, for categorical variables, regression will give explicitely which level that record is for, and the level that without an output row is the reference level. Dichotomous variable is just a specific case of categorical variable. For example if you have sex (0/1), it usually gives sex[1] xx, xx, xx, xx..., that means 0 is used as a reference. Same strategy is used to display interaction items that involve categorical variables.

You have to explicitly make them categorical as well by, e.g., tbl.sex = categorical(tbl.sex); otherwise by default it is used as a continous variable, and thus 0 is always the default reference value.

In the equation you used, FFR doesn't appear as a fixed effect. If you only want a subject specific intercept, use (1|ID) otherwise make sure that that's what you really want.

10 件のコメント
8 件の古いコメントを表示8 件の古いコメントを非表示

Robert Joniec 2020 年 5 月 11 日

編集済み: Robert Joniec 2020 年 5 月 11 日

MATLAB Online で開く

Please see below the full output as reported by matlab. You will notice that estimates are a bit different (shifted), however this is due to a minor adjustment in the data. (No changes to all else)

For the use of the interaction terms I already had to make sure the variables are categorical using the command you referenced. Just double checked the table (tbl_early) and they are 'categorical'.

FFR and ID are meant to be random slope and intercept, hope this is what they actually are.

Nevertheless, good points and open for any further suggestions!

Model information:
    Number of observations            1672
    Fixed effects coefficients          25
    Random effects coefficients        248
    Covariance parameters                4
 
Formula:
    Linear Mixed Formula with 15 predictors.
 
Model fit statistics:
    AIC       BIC       LogLikelihood    Deviance
    1093.5    1250.7    -517.73          1035.5 
 
Fixed effects coefficients (95% CIs):
    Name                                      Estimate     SE           tStat       DF      pValue        Lower        Upper     
    '(Intercept)'                               -1.8845      0.31006     -6.0778    1647    1.5106e-09      -2.4926        -1.2763
    'Dur_half'                                  0.28114      0.19358      1.4523    1647       0.14661     -0.09855        0.66082
    'Dur_onehalf'                              -0.16074     0.072323     -2.2225    1647      0.026386     -0.30259      -0.018882
    'Dur_three'                               -0.055263       0.1671    -0.33071    1647        0.7409     -0.38302        0.27249
    'HU_1'                                    -0.075889     0.038204     -1.9864    1647       0.04715     -0.15082    -0.00095628
    'Storm_1'                                    -0.217     0.044114      -4.919    1647    9.5672e-07     -0.30352       -0.13047
    'EQ_1'                                     0.061545     0.044709      1.3766    1647       0.16883    -0.026146        0.14924
    'Wstorm_1'                                 -0.14178     0.047647     -2.9756    1647     0.0029669     -0.23524      -0.048323
    'logAP'                                     0.33387     0.033934      9.8388    1647    3.0976e-22      0.26731        0.40043
    'logEP'                                     0.01018     0.027846     0.36559    1647       0.71472    -0.044437       0.064797
    'PrimaryHU_1'                              0.020094     0.047664     0.42159    1647       0.67338    -0.073393        0.11358
    'PPI'                                      0.011507    0.0025507      4.5114    1647    6.8953e-06    0.0065044        0.01651
    'AvgEffTax_1'                               0.26183     0.096878      2.7027    1647     0.0069484     0.071816        0.45185
    'logLoL'                                    0.38062     0.021995      17.305    1647    8.9533e-62      0.33748        0.42376
    'logAnnLioL'                                0.23723      0.01393      17.031    1647    4.9212e-60      0.20991        0.26455
    'relInsLoss_1'                             -0.97689      0.62024      -1.575    1647       0.11544      -2.1934        0.23964
    'logAP:logEP'                              0.026845    0.0037648      7.1306    1647    1.4901e-12     0.019461        0.03423
    'Storm_1:PrimaryHU_1'                       0.17007     0.067152      2.5327    1647      0.011413     0.038361        0.30179
    'EQ_1:PrimaryHU_1'                         -0.23205     0.070344     -3.2987    1647    0.00099188     -0.37002      -0.094074
    'Storm_1:relInsLoss_1'                      0.88003      0.76047      1.1572    1647       0.24735     -0.61156         2.3716
    'EQ_1:relInsLoss_1'                         -1.1103      0.79964     -1.3885    1647       0.16518      -2.6787        0.45815
    'Wstorm_1:relInsLoss_1'                      0.7301      0.38568       1.893    1647      0.058529    -0.026369         1.4866
    'PrimaryHU_1:relInsLoss_1'                   1.0931      0.65186      1.6769    1647      0.093761     -0.18549         2.3717
    'Storm_1:PrimaryHU_1:relInsLoss_1'          -1.6867      0.79209     -2.1295    1647      0.033364      -3.2403       -0.13311
    'EQ_1:PrimaryHU_1:relInsLoss_1'              1.8402      0.83281      2.2096    1647      0.027272      0.20667         3.4736
 
Random effects covariance parameters (95% CIs):
Group: ID (124 Levels)
    Name1                Name2                Type          Estimate    Lower       Upper 
    '(Intercept)'        '(Intercept)'        'std'         0.59814      0.48675    0.73502
    'FFR'                '(Intercept)'        'corr'        0.74869      0.57692    0.85704
    'FFR'                'FFR'                'std'         0.07566     0.056887    0.10063
 
Group: Error
    Name             Estimate    Lower      Upper 
    'Res Std'        0.28828     0.27801    0.29892

Robert Joniec 2020 年 5 月 13 日

MATLAB Online で開く

Hi Peng,

please excuse me if i repeat myself but it is not clear to me yet. I have included a simpler case which may be more appropriate for our discussion (please see below). Here the multi-level categrical variable Dur is replaced by a similar 4-level variable ReinsPrice (A_@100, reference; A_@50; A_Free; A_none)

Again, HU, Storm, EQ are not mutually exclusive which is why I would prefer to not transform the data into a single nominal variable (levels would be all the possible combinations: HU, HU_Storm, HU_EQ, HU_Storm_EQ, Storm, Storm_EQ, EQ) because this affects how I can combine the dichtonomous variables in initially mentioned three-way interaction terms. As far as I remember, the previous use of a nominal variable has produced a non full rank design matrix. Second,if I understood correctly, the output above and below implies that the reference contract is one where HU=0, Storm=0 and EQ=0 (as opposed to =1).

Now, how is the output to be interpreted if there is no such case in the data where the variables are all equal to zero? What is the reference case that i would need to refer to?

Thanks & best

Rob

Linear mixed-effects model fit by ML
Model information:
    Number of observations            1584
    Fixed effects coefficients          13
    Random effects coefficients        328
    Covariance parameters                4
Formula:
    logRoL ~ 1 + ReinsPrice + HU + Storm + EQ + AvgEffTax_1 + logLoL + logAnnLioL + logAP*logEP + (1 + FFR | ID)
Model fit statistics:
    AIC       BIC       LogLikelihood    Deviance
    674.32    765.57    -320.16          640.32  
Fixed effects coefficients (95% CIs):
    Name                      Estimate      SE          tStat      DF      pValue        Lower        Upper    
    '(Intercept)'               -0.49915     0.18673    -2.6731    1571     0.0075931     -0.86541     -0.13288
    'ReinsPrice_A_@50'           0.15053     0.20984    0.71735    1571       0.47326     -0.26107      0.56213
    'ReinsPrice_FREE'           -0.15159    0.045847    -3.3065    1571    0.00096599     -0.24152    -0.061666
    'ReinsPrice_none'            0.20374     0.08233     2.4747    1571      0.013441     0.042252      0.36523
    'HU_1'                        0.1472      0.0353     4.1699    1571    3.2139e-05     0.077957      0.21644
    'Storm_1'                   -0.14842    0.041395    -3.5856    1571    0.00034664     -0.22962    -0.067229
    'EQ_1'                      -0.14224    0.045327     -3.138    1571     0.0017326     -0.23114    -0.053329
    'logAP'                      0.20518     0.06183     3.3185    1571    0.00092572     0.083905      0.32646
    'logEP'                   -0.0017517    0.044235    -0.0396    1571       0.96842    -0.088518     0.085015
    'AvgEffTax_1'               -0.39303    0.099428    -3.9529    1571    8.0646e-05     -0.58805       -0.198
    'logLoL'                     0.83301    0.086883     9.5878    1571    3.3793e-21       0.6626       1.0034
    'logAnnLioL'                 0.13134    0.011369     11.552    1571    1.0946e-29      0.10904      0.15364
    'logAP:logEP'               0.051937     0.00492     10.556    1571    3.2424e-25     0.042287     0.061588
Random effects covariance parameters (95% CIs):
Group: ID (164 Levels)
    Name1                Name2                Type          Estimate    Lower      Upper  
    '(Intercept)'        '(Intercept)'        'std'          0.60045    0.47624    0.75707
    'FFR'                '(Intercept)'        'corr'         0.82464    0.70156    0.89994
    'FFR'                'FFR'                'std'         0.092079    0.06979    0.12149
Group: Error
    Name             Estimate    Lower      Upper  
    'Res Std'        0.25382     0.24429    0.26373

Robert Joniec 2020 年 5 月 14 日

Peng thank you so much for your patience so far. Please excuse me going in again....

My question is regarding the "holding other variables constant" - constant at which level? If HU_1 is the effect that is associated with including HU_1 (for example, as one of three contract features, where at least one has to be chosen and they can be combined), does this mean that the estimate of 0.1492 is for the case in which Storm_1 and EQ_1 are both excluded (say =0)? This would mean that including HU makes the contract more expensive in comparison to Storm and EQ.

The question of how the contract features affect pricing makes a big difference in the interpretation of interaction terms from the model output posted on May 11... There we have interaction terms like Storm_1:PrimaryHU_1:relInsLoss_1. Unfortunately, the naming of "relInsLoss_1" (continuous variable!) is misleading because I named this variable like this, not knowing that matlab would add "_1" to the dichtonomous variables. Anyway, the estimate of -1.6867 would be for the case in which Storm =1 and PrimaryHU = 1? PrimaryHU being a condition that does not occurr always, the interpretation would be something like: A contract that includes Storm is less expensive in years in which PrimaryHU occurrs. Is the "less expensive" in relation to occurrence of PrimaryHU or is it in relation HU, EQ and Wstorm?

It gives me such a hard time because it does not make sense that the event of PrimaryHU triggers a discount for all contracts that have Storm as feature. It would only make sense if the pricing difference is in relation to other contracts.Then, the interpretation would be that prices of such reference contracts increase and contracts with a Storm feature are traded at a discount in relation to such an increase (they do not follow the increase).

All the best

Rob

Peng Li 2020 年 5 月 15 日

Hi Rob,

You are welcome! See some further clarifications:

My question is regarding the "holding other variables constant" - constant at which level? If HU_1 is the effect that is associated with including HU_1 (for example, as one of three contract features, where at least one has to be chosen and they can be combined), does this mean that the estimate of 0.1492 is for the case in which Storm_1 and EQ_1 are both excluded (say =0)? This would mean that including HU makes the contract more expensive in comparison to Storm and EQ.

Since you don't have an interaction item that is related to either of these parameters, it doesn't matter you hold other variables constant at whatever level. See below an easier case:

y1 = int + b1*HU + b2*Storm

You change Storm from 0 to 1, the only difference is that the regression line (y1~Hu) moves up or down (depending on the sign of b2) without any changes in the slope (b1). This indicates that the effect of HU on y1 is not modified by the effect of Storm (which may not be true as you don't have an interaction). Levels of Storm only result in a change of the intercept for y1~HU relationship. Another example:

y2 = int + b1*HU + b2*Storm + b12*HU*Storm

The inclusion of HU*Storm makes it available to test if the effect of either one is modified by the other one. For example, if you hold Storm at 0, the effect of HU on y2 is b1 (y2 = int + b1*HU + 0 + 0). If you hold Storm at 1, the effect of HU on y2 is b1+b12 (y2 = int + b1*HU + b2 + b12*HU; again it involves a change in intercept as well).

Does this make sense?

The question of how the contract features affect pricing makes a big difference in the interpretation of interaction terms from the model output posted on May 11... There we have interaction terms like Storm_1:PrimaryHU_1:relInsLoss_1. Unfortunately, the naming of "relInsLoss_1" (continuous variable!) is misleading because I named this variable like this, not knowing that matlab would add "_1" to the dichtonomous variables. Anyway, the estimate of -1.6867 would be for the case in which Storm =1 and PrimaryHU = 1? PrimaryHU being a condition that does not occurr always, the interpretation would be something like: A contract that includes Storm is less expensive in years in which PrimaryHU occurrs. Is the "less expensive" in relation to occurrence of PrimaryHU or is it in relation HU, EQ and Wstorm?

It is always tricky to interpret interaction items involving more than two predictors. I'd always like to not involve this unless I do have a hypothesis that they behave really such complicatedly.

back to this questions, let's again put it into this simple example

y3 = int + b1*Storm + b2*PrimaryHU + b3*rellnsLoss + b13*Storm*rellnsLoss + b23*PrimaryHU*rellnsLoss + b123*Storm*PrimaryHU*rellnsLoss

The interaction item generally indicates how the effect of one of the predictor is modifed by the other(s).

So if you are interested in the effect of rellnsLoss, if Storm=0 and PrimaryHU=0, it's effect is b3 (1 unit increase in rellnsLoss renders b3 increase in y3--this might be a decrease if b3 is actually negative). If then you want to know how this effect change if PrimaryHU=1, it is b3+b23 (b123 hasn't been involved as Storm is still 0). And again, if you want to know the effect if both Storm=1 and PrimaryHU=1, it is b3+b13+b23+b123. It depends on the signs and absolute values that you can know whether this would be a negative association or positive association.

Again these effects are there regardless of the involvement of other covariates, as you don't have interactions involved (inclusion of other covariates again offers changes in intercept). But if you do, put them in equations and examine them in similar way.

Does this make sense now?

It gives me such a hard time because it does not make sense that the event of PrimaryHU triggers a discount for all contracts that have Storm as feature. It would only make sense if the pricing difference is in relation to other contracts.Then, the interpretation would be that prices of such reference contracts increase and contracts with a Storm feature are traded at a discount in relation to such an increase (they do not follow the increase).

This depends on your hypothesis. If you think the effect of the predictor shouldn't have been modified by the other predictor, then you shouldn't include that interaction item. Math doesn't know the hypothesis in your mind; it just gives whatever you feed it. The output, no matter being significant or not, is just the output from this math; it can be correct; it can be incorrect. It is always the people who know the best of the data!

What do you think?

Peng

Robert Joniec 2020 年 5 月 15 日

編集済み: Robert Joniec 2020 年 5 月 15 日

Peng,

very helpful and highly appreciated. The brief back-to-basics was on point. There are just some side-considerations left for now.

Taking up your example of:

y3 = int + b1*Storm + b2*PrimaryHU + b3*rellnsLoss + b13*Storm*rellnsLoss + b23*PrimaryHU*rellnsLoss + b123*Storm*PrimaryHU*rellnsLoss

&

... And again, if you want to know the effect if both Storm=1 and PrimaryHU=1, it is b3+b13+b23+b123...

Based on the output posted on 11th May the relevant estimates would be (p-values):

b3, relInsLoss: -0.97689 (0.11544)
b13, Storm*relInsLoss: 0.88003 (0.24735)
b23, PrimaryHU*relInsLoss: 1.0931 (0.093761)
b123, Storm*PrimaryHU*relInsLoss: -1.6867 (0.033364)

This is what I get by including the term "Storm*PrimaryHU*relInsLoss_1" into the model. In order to calculate the estimate as you did in your example in which Storm=1 and PrimaryHU=1, would you consider the estimates that are not significant at a chosen p-value level (e.g. 0.1) as well ?

y3= int + ... + -0.97689 + 0.88003 + 1.0931 + -1.6867 +...

Does this mean that the three-way interaction and this case specific addition is legitimate but it wouldnt be for the two-way interaction ?

y3= int + ... + -0.97689 + 0.88003 +...

I will need to rethink how the three-way interaction matches the hypothesis. In the meantime I have doublechecked if there are any data issues and it seems that everything is fine.

Many thanks man!

Rob

Peng Li 2020 年 5 月 15 日

Hi Rob,

To be clear, we should add the predictor in the equation

y3= int + ... + (-0.97689 + 0.88003 + 1.0931 + -1.6867)*rellnsLoss_1 +...

This is a bit tricky as stats should come later than hypothesis. If you do believe that they behave such complicatedly, you could include the three way interaction. And given that you have a bunch of other covariates as well, you should think if you do have power to take all into consideration (this goes back to your sample size and expected/estimated effect size).

Specifically, for your case, both two way interaction items are not significant. I'd rather not loose the condition, say increase the alpha level to 0.1. It's quite a tricky thing. Sometime people have no power with a small sample size or small effect size, they may do this. I don't have a black and white answer for this. There is already a hot discussion on using alpha level 0.05 these years.

For the nonsignificant interaction item, what people also do is to drop it afterwards (post hoc). You can drop it and see if the effect size change dramatically. And it is tricky again to calculate the actual effect size for rellnsLoss_1 as if you take both two-way interaction items into consideration, the b123 is almost calcelled out by b13 and b23. And the effect will be quite different if you don't take them into consideration. But anyway it's based on the equation and I do believe you should take all into consideration. And in this case, the effect of rellnsLoss_1 doesn't change too much (depending on the scale!) if you change both Storm and PrimaryHU to 1. In addition, you could try to drop the two two-way interaction and test again. The three way interaction might be gone then (not significant), and conclusion will be that the effect of rellnsLoss doesn't depend on either Storm or PrimaryHU.

Hope this helps!

--Peng

Robert Joniec 2020 年 5 月 19 日

Hi Peng,

sorry for the silence! Your last answer helps to complete the picture. Thank you! Regarding the level of alpha I have used the level of 0.1 to decide if variables stay in the final model or not. I think what you refer to is well described in Wasserstein, Schirm, & Lazar (2019).

The sample size is 1500-ish, however, the model is quite complex and we see some sensitivity towards how results change if the model specification is altered. The relationship between our variables are indeed not trivial and this has been one the reasons why we decided that the interaction terms are needed. Remaining sensitivity is (partially) due to what lies behind the data. (It covers 14 points in time thus the annual 'Storm' sub-sample sometimes gets as little as 50 while a good portion of the 50 would then also include 'HU' and 'EQ' in all possible combinations...). Bootstrapping standard errors proofed to give limited insights as the design matrix of subsamples often is not of full rank if the interaction terms are included (due to the underlying information of course).

In your last point you mentioned that it is possible to drop the two-way interaction and to test again. How would you actually do it? From using the term 'Storm*PrimaryHU*relInsLoss_1' Matlab automatically includes the inherent two-way interaction terms and single variables, thus it sounds like a certain exclusion that would needed to be included in the code?

All the best - Rob

Peng Li 2020 年 5 月 28 日

MATLAB Online で開く

Hi Rob,

Sorry that I overlooked this thread. This is replying your last question: you could explicitly drop the interaction item in your equation by adding "- Storm*PrimaryHU*relInsLoss_1", or you can add each item seperately. Again, to make this simpler:

y ~ x1*x2
y ~ x1 + x2 + x1:x2

these two are identical, both being with the interaction between x1 and x2.

y ~ x1 + x2
y ~ x1*x2 - x1:x2

these two are identical, both being without are the interaction item.

y ~ x1*x2*x3 - x1:x2:x3

means all main effects plus two way interactions between each pairs. The three way interaction item is abandoned by using - x1:x2:x3.

Check this

https://www.mathworks.com/help/stats/fitlme.html#btyaa5y-formula go to More About --> formula

Hope this helps!

--Peng

サインインしてコメントする。

What is the reference category in the output for a Fitlme with categorical variables and three-way interaction terms?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

10 件のコメント
8 件の古いコメントを表示8 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

What is the reference category in the output for a Fitlme with categorical variables and three-way interaction terms?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

10 件のコメント 8 件の古いコメントを表示8 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

10 件のコメント
8 件の古いコメントを表示8 件の古いコメントを非表示