Why is SVM performance with small random datasets so high?
39 ビュー (過去 30 日間)
To understand more how SVMs work, I am training a binary SVM with the function fitcsvm, using a sample data set of completely random numbers and cross-validating the classifier with a 10-fold cross-validation.
Since the dataset consists of random numbers, I would expect the classification accuracy of the trained cross-validated SVM to be around 50%.
However, with small datasets, for example consisting of 2 predictors and 12 observations (6 per class), I get very high classification accuracy, up to about 75%. Classification accuracy gets close to 50% by increasing the dataset, for example 2 predictors and 60 observations or 40 predictors and 12 observations. Why with small datasets is the classification accuracy so high?
I guess that with small datasets you might more easily go into over-fitting. Is this the case here?
Anyway, with cross-validation, the SVM is recursively trained on nine partitions and tested on the tenth. Even if the dataset is small, I would anyway expect an accuracy of around 50%, simply because the tenth partition is made of random numbers. Does the cross-validation perform some optimization of the model parameters?
The code that I am using is something like the following, where I try 100 different combinations of Kernel Scale and Box Constraint and then take the combination that yields the lowest classification error:
SVMModel = fitcsvm(cdata, label, 'KernelFunction','linear', 'Standardize',true,...
MisclassRate = kfoldLoss(SVMModel);
I would very much appreciate any clarification. Many thanks!
Ilya 2017 年 2 月 27 日
Let me make sure I got your procedure right. You apply M models to a dataset and measure their accuracies by cross-validation. Each model is described by a set of parameter values such as box constraint and kernel scale. Out of these models, you select the one with largest cross-validation accuracy a_best and record the parameter values for this model pars_best. To estimate the significance of this model, you learn the same model (that is, pass pars_best to fitcsvm) on R synthetic datasets. Each synthetic dataset is obtained by randomly permuting class labels in the original dataset. You estimate cdf F(a) over these R accuracy values. Then you take 1-F(a_best) to be the p-value for the null hypothesis "the model pars_best has no discriminative power".
If I got this right, you should modify your procedure like so. In every run (that is, for every noise dataset), instead of recording accuracy of a model learned using pars_best, search for the best model over M parameter values and record the accuracy for that best model. Estimate cdf F_noisebest(a) using these R values and take 1-F_noisebest(a_best) to be the p-value.
In your procedure, you apply a classifier to a noise dataset and its accuracy is expected to be that of a random coin toss (perhaps, unfair toss if you have imbalanced classes). In my procedure, you choose the best out of M classifiers applied to a noise dataset and the best chosen accuracy is going to be most usually better (or a lot better) than a random coin toss. This could increase your estimate of the p-value quite a bit making the best model pars_best less significant.
You could also use simple analytic formulas for the binomial distribution and order statistic to verify your computation.
その他の回答 (1 件)
Ilya 2017 年 1 月 31 日
You have 12 observations. For each observation, the probability of correct classification is 0.5. What is the probability of classifying 9 or more observations correctly by chance? It's
>> p = binocdf(8,12,0.5,'upper')
And what is the probability of that chance event occurring at least once in 100 experiments? It's
Since you take the most accurate model, you always get a highly optimistic estimate of accuracy, that's all.