real or categorical predictors, which one is faster?

1 回表示 (過去 30 日間)
mono
mono 2023 年 9 月 17 日
編集済み: dpb 2023 年 9 月 19 日
In regressions, is there a guidline to treat predictors as real values or categorical?
In a fitting problem with input as X, y where X contains the hour of the day information, e.g. 1, 2, 3, etc.., I tend to consider it as a categorical predictor because the length of unique(X) is limited (i.e. 24). Surprislingly, the fitting procedures seem slower than treating it as real values in a gaussian process fitrgp.
My questions are:
  1. why does it take longer with categorical predictor?
  2. in a similar situation, is there a guidline to decide whether take the predictors as real values or categorical inputs?
  3 件のコメント
mono
mono 2023 年 9 月 17 日
I don't think fitrgp supports the arguement of uint8 type input.
dpb
dpb 2023 年 9 月 19 日
編集済み: dpb 2023 年 9 月 19 日
"why does it take longer with categorical predictor?"
I'd venture owing to the large number of dummy variables introduced by having 24 levels of time being modeled as categorical instead of continuous/discrete. You could try artificially reducing the same data set to 24, 12, 2 levels and see if that hypothesis is correct.
Regardless of whether it's true or not, it's still the model definition and purpose that should be controlling decisions such as this, not anything to do with compute time.

サインインしてコメントする。

採用された回答

dpb
dpb 2023 年 9 月 17 日
編集済み: dpb 2023 年 9 月 17 日
Whether to use categorical or continuous variable is context sensitive and should be based on the model intent and interpretation, NOT the compute effort required.
If the time was measured in discrete intervals such as hours, then the question revolves about the interpretation of the response variable -- if the time is considered as continuous, then the regression model will calculate a coefficient between that predictor and the response variable after accounting for other variables in the model. Since the discrete predictor is numerical, fitting a line to it can be reasonable; its values are true numbers with meaningful intervals between them.
On the other hand, If you specify a predictor as categorical, the regression estimates a mean of the response variable for each category of the predictor and each of the set of coefficients for that predictor measures the difference in the means of Y between one category of X and the reference category.
So, the decision should be based on the research question being asked -- is it about estimating a change over X (time) or is it about estimating mean differences at specific values of X (hours)?
Which solution is more expensive in compute time is irrelevant...
  4 件のコメント
mono
mono 2023 年 9 月 19 日
編集済み: mono 2023 年 9 月 19 日
For categorical data, after dummy encoding, it is still numerical. But as your point out, the interval meaning (if there is ) between them are lost.
By instinct, unstricly speaking, perphas cases like predictors of [apple, pear, orange] shape, it makes more sense to treat them as categorical. For cases like hours or time as input X, if the gaps between them also play the role, perphas it is reasonable to treat them as numerical without dummy encoding.
Seems that is the concepts differences between categorical and discrete?
If so, that leads to another interesting question. Take fitrgp function as example, if there anything I can do to speed up the optimizations for discrete input? As @Walter Roberson also pointed out, the metories used for uint, e.g. 3, is much less than double type, e.g. 3.0. I guess if without data type declaration within the fitrgp function, it would treat all numerical input as double without consideration of it is discrete or continuous.
dpb
dpb 2023 年 9 月 19 日
The solution time is what it is going to be dependent upon the model and the size of the input dataset -- MATLAB treats everything as double numerically and computationally, the designation in fitrgp of data class is going to be immaterial when passed to the function and the only way can be anything but a single numeric class of double when calling it is to use a table that has variables defined differently. But, even if do so, they're all going to be converted to double() internally, anyway.
The consideration of whether a variable is categorical is material only in the interpretation and model-building as to whether it creates the dummy variables or not.
There are some notes about using the QR or "V" solution technique, the latter of which is apparently faster but not exact, but it appears that it will use the latter automagically if the size of the input >2000 anyway. I'd expect that's the only choice you have in affecting the solution speed other than the uncontrollable effect of specific datasets other than the size of the dataset itself.

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeGaussian Process Regression についてさらに検索

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by