一般化線形モデルのワークフロー

ライブスクリプトを開く

この例では、一般化線形モデルの当てはめ方法と結果の解析方法について説明します。典型的なワークフローには、次のような手順が含まれます。データのインポート、一般化線形モデルの当てはめ、品質テスト、品質改善のためのモデルの変更、モデルに基づく予測。この例では、フィッシャーのアヤメのデータを使用して、花が 2 つのクラスのいずれかである確率を計算します。

データの読み込み

フィッシャーのアヤメのデータを読み込みます。

load fisheriris

51 行目から 150 行目を抽出します。これは versicolor または virginica の分類をもちます。

X = meas(51:end,:);

versicolor の場合に true、virginica の場合に false となる論理応答変数を作成します。

y = strcmp('versicolor',species(51:end));

一般化線形モデルの当てはめ

二項一般化線形モデルをデータに当てはめます。

mdl = fitglm(X,y,'linear','Distribution','binomial')

mdl = 
Generalized linear regression model:
    logit(y) ~ 1 + x1 + x2 + x3 + x4
    Distribution = Binomial

Estimated Coefficients:
                   Estimate      SE       tStat      pValue 
                   ________    ______    _______    ________

    (Intercept)     42.638     25.708     1.6586    0.097204
    x1              2.4652     2.3943     1.0296     0.30319
    x2              6.6809     4.4796     1.4914     0.13585
    x3             -9.4294     4.7372    -1.9905    0.046537
    x4             -18.286     9.7426    -1.8769    0.060529


100 observations, 95 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 127, p-value = 1.95e-26

モデル表示によると、pValue 列には小さくない p 値があります。これは、モデルが単純化できることを示しています。

モデルの調査と改善

係数の 95% の信頼区間に 0 が含まれているかどうかを確認します。該当する場合、それらの区間でモデルの項を削除できます。

confint = coefCI(mdl)

confint = 5×2

   -8.3984   93.6740
   -2.2881    7.2185
   -2.2122   15.5739
  -18.8339   -0.0248
  -37.6277    1.0554

4 番目の予測子 x3 のみが、信頼区間に 0 を含まない係数をもっています。

x1 および x2 の係数には大きな p 値があり、その 95% 信頼区間には 0 が含まれています。両方の係数を 0 にできるかどうかをテストします。仮説行列を指定して、x1 と x2 の係数を選択します。

M = [0 1 0 0 0     
     0 0 1 0 0];   
p = coefTest(mdl,M)

p = 0.1442

p 値は約 0.14 で、小さくありません。x1 と x2 をモデルから削除します。

mdl1 = removeTerms(mdl,'x1 + x2')

mdl1 = 
Generalized linear regression model:
    logit(y) ~ 1 + x3 + x4
    Distribution = Binomial

Estimated Coefficients:
                   Estimate      SE       tStat       pValue  
                   ________    ______    _______    __________

    (Intercept)     45.272     13.612      3.326    0.00088103
    x3             -5.7545     2.3059    -2.4956      0.012576
    x4             -10.447     3.7557    -2.7816     0.0054092


100 observations, 97 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 118, p-value = 2.3e-26

あるいは、stepwiseglm を使用して重要な予測子を識別できます。

mdl2 = stepwiseglm(X,y,'constant','Distribution','binomial','Upper','linear')

1. Adding x4, Deviance = 33.4208, Chi2Stat = 105.2086, PValue = 1.099298e-24
2. Adding x3, Deviance = 20.5635, Chi2Stat = 12.8573, PValue = 0.000336166
3. Adding x2, Deviance = 13.2658, Chi2Stat = 7.29767, PValue = 0.00690441

mdl2 = 
Generalized linear regression model:
    logit(y) ~ 1 + x2 + x3 + x4
    Distribution = Binomial

Estimated Coefficients:
                   Estimate      SE       tStat      pValue 
                   ________    ______    _______    ________

    (Intercept)     50.527     23.995     2.1057    0.035227
    x2              8.3761     4.7612     1.7592    0.078536
    x3             -7.8745     3.8407    -2.0503    0.040334
    x4              -21.43     10.707    -2.0014     0.04535


100 observations, 96 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 125, p-value = 5.4e-27

係数テーブルの x2 の p 値 (pValue) は 0.05 より大きいですが、x2 の追加に対する p 値 (PValue) が 0.05 より小さいため、stepwiseglm はモデルに x2 を含めます。関数 stepwiseglm は、x2 がある場合とない場合についての当てはめを使用して PValue を計算しますが、pValue の計算は、最終モデルからのみ計算された近似標準誤差に基づいて行われます。そのため、PValue は pValue よりも信頼性が高くなります。

外れ値の特定

てこ比のプロットを調べて、影響のある外れ値を探します。

plotDiagnostics(mdl2,'leverage')

Figure contains an axes object. The axes object with title Case order plot of leverage, xlabel Row number, ylabel Leverage contains 2 objects of type line. One or more of the lines displays its values using only markers These objects represent Leverage, Reference Line.

てこ比が p/n を大幅に超える場合、観測値は外れ値であると考えることができます。ここで、p は係数の個数、n は観測値の個数です。点線の基準線は推奨されるしきい値であり、2*p/n で計算されます。これは、このプロットの 0.08 に対応しています。一部の観測値は、10*p/n (つまり、0.40) より大きなてこ比値を示しています。これらの観測値の点を特定します。

idxOutliers = find(mdl2.Diagnostics.Leverage > 10*mdl2.NumCoefficients/mdl2.NumObservations)

idxOutliers = 4×1

    19
    21
    57
    85

これらの点を除外したモデルを当てはめたときに、モデル係数が変わるかどうかを確認します。

oldCoeffs = mdl2.Coefficients.Estimate;
mdl3 = fitglm(X,y,'linear','Distribution','binomial', ...
    'PredictorVars',2:4,'Exclude',idxOutliers);
newCoeffs = mdl3.Coefficients.Estimate;
disp([oldCoeffs newCoeffs])

   50.5268   44.0085
    8.3761    5.6361
   -7.8745   -6.1145
  -21.4296  -18.1236

mdl3 のモデル係数は、mdl2 の係数と異なっています。この結果は、てこ比が高い点での応答が、次元削減されたモデルでの予測値と一致しないことを意味しています。

versicolor である確率の予測

平均的な測定値の花が versicolor である確率を予測するには、mdl3 を使用します。この予測に対する信頼区間を生成します。

[newf,newc] = predict(mdl3,mean(X))

newf = 0.4558

newc = 1×2

    0.1234    0.8329

このモデルは、広い信頼区間をもち、平均的な花が versicolor である確率を約 46% としています。

参考