Comparing classification performance using Friedman Test

77 ビュー (過去 30 日間)
MByk
MByk 2025 年 11 月 27 日 11:34
コメント済み: dpb 2025 年 11 月 29 日 12:38
I am trying to compare the performance of three classifiers across four performance metrics using the Friedman test in MATLAB. Since MATLAB does not include a built-in Nemenyi post-hoc test, I used the "multcompare" function as suggested in related discussions. I obtained the following results. If I understand correctly, a high p-value indicates that there is no significant difference between the classifier performances. How should I interpret the values in c and m? Am I doing something wrong? Can Pearson's r be used to compare the classifiers instead of Friedman and other post-hoc test? Thanks for the help.
PrfMat = [0.9352 0.9697 0.7475 0.9877;
0.9670 0.8713 0.8414 0.7052;
0.6944 0.6841 0.9851 0.9897];
[p,~,stats] = friedman(PrfMat, 1, 'on')
[c,m] = multcompare(stats, 'CType', 'tukey-kramer')
p = 0.8013
c = 1.0000 2.0000 -2.3747 0.3333 3.0413 0.9891
1.0000 3.0000 -2.0413 0.6667 3.3747 0.9216
1.0000 4.0000 -3.0413 -0.3333 2.3747 0.9891
2.0000 3.0000 -2.3747 0.3333 3.0413 0.9891
2.0000 4.0000 -3.3747 -0.6667 2.0413 0.9216
3.0000 4.0000 -3.7080 -1.0000 1.7080 0.7785
m = 2.6667 0.7454
2.3333 0.7454
2.0000 0.7454
3.0000 0.7454
  1 件のコメント
Umar
Umar 2025 年 11 月 29 日 5:27

@MByk, you mentioned, “ The classification performances of this database are quite similar, but I'd like to understand how to interpret the p, c, and m values and would PCC be sufficient instead of running two separate tests? With a high p-value, would we conclude that there is no statistically significant difference between the three classifiers across the four metrics? I'd also like to understand what c and m represent. If I'm not mistaken, pairwise comparisons (Classifier 1 vs. C2, C1 vs. C3, and C2 vs. C3) should also be performed to show where the differences occur, but I don't see these in the results. May be "Multcompere" is not performing what i want.

My feedback:please don't give up on the Friedman test because you're actually in an excellent position to get meaningful results. The confusion in this thread comes from everyone analyzing that small toy example you posted at the beginning, which genuinely failed because it only had three classifiers with four measurements each and extremely high variability. That tiny example was statistically underpowered and correctly showed no differences, but your real study is completely different. You mentioned you have 1500 patients with 12 features, testing seven classifiers from the Classification Learner app, and collecting four performance metrics for each classifier. This is actually a robust experimental design with plenty of statistical power to detect real differences if they exist.

The main technical issue you encountered was data orientation. MATLAB's friedman function expects columns to be the groups you're comparing and rows to be the repeated measurements. You had classifiers as rows and metrics as columns, which made MATLAB compare your four metrics instead of your seven classifiers. For your real analysis, you need to create a matrix that's four rows by seven columns, where each row is one of your performance metrics and each column is one of your seven classifiers. So the first row would contain the accuracy scores for all seven classifiers, the second row would be precision for all seven classifiers, the third row recall, and the fourth row F1 scores. Then when you run friedman on this properly arranged matrix, it will actually compare your classifiers across the four metrics.

Regarding the output interpretation, the c matrix from multcompare shows pairwise comparisons between your classifiers. The first two columns tell you which two classifiers are being compared, the middle three columns give you the confidence interval and estimated difference in their rankings, and the last column is the p-value telling you whether that pair differs significantly. The m matrix is simpler, just showing the mean rank for each classifier and its standard error. Lower ranks mean better performance. When you run this on your actual data with seven classifiers, you should get meaningful results because you have 1500 patients worth of data, which is 375 times more than the example that failed.

About using Pearson correlation instead, that's not appropriate for what you're trying to do. Pearson correlation measures whether two variables move together in a linear relationship, but you want to know if your classifiers perform differently from each other. These are fundamentally different statistical questions. The Friedman test is correct for comparing multiple groups that are measured repeatedly, which is exactly your situation with seven classifiers each evaluated on four metrics.

The critical thing to understand is that even if your Friedman test comes back non-significant with a high p-value, that's a valid scientific finding, not a failure of your analysis. It would mean your seven classifiers perform equivalently on your dataset, and you should then choose among them based on practical considerations like training time, interpretability, computational cost, or ease of deployment. Many published studies report no significant differences among methods, and this is honest, valuable scientific information. You would simply report something like "The Friedman test revealed no statistically significant differences among the seven classifiers tested on 1500 patients, suggesting that all performed comparably on this dataset."

Your workflow should be straightforward. First, arrange your data correctly as that four by seven matrix. Second, visualize it with a simple boxplot to see what the distributions look like. Third, run the Friedman test with the command friedman on your matrix with one replication. Fourth, only if the overall Friedman test gives you a p-value less than 0.05 should you proceed with post-hoc pairwise comparisons using multcompare. If the p-value is greater than 0.05, you stop there and report that no significant differences were found. The mistake many researchers make is trying multiple different tests hoping to find significance somewhere, which is statistically invalid. Whatever your Friedman test tells you is the answer, whether significant or not.

The reason both dpb and I spent so much time explaining the example's failure was educational, showing you how sample size and variability affect statistical power. They demonstrated that with only four observations and high variability, you can't detect differences even if they exist. But this lesson doesn't apply to your real study. You have 1500 patients, which provides robust statistical power. The Classification Learner app likely used cross-validation, which means your performance metrics are reliable estimates. Your study design is solid and appropriate for the Friedman test.

If you want to proceed confidently, here's exactly what to do with your real data. Create your performance matrix where PerfMatrix equals a four by seven array, with PerfMatrix row one being all seven accuracy values, row two being all seven precision values, row three being all seven recall values, and row four being all seven F1 scores. Make a boxplot of this matrix to visualize the distributions. Run the Friedman test with the command brackets p comma tbl comma stats close brackets equals friedman open parenthesis PerfMatrix comma one comma quote on quote close parenthesis. Look at the p-value. If it's less than 0.05, run multcompare on the stats output to see which specific pairs differ. If it's greater than 0.05, you're done and can report that all classifiers performed similarly.

The confusion about whether MATLAB's multcompare is equivalent to the Nemenyi post-hoc test is resolved in the literature, which confirms that multcompare with Tukey-Kramer critical values after a Friedman test is mathematically equivalent to the Nemenyi test for ranked data. So you're using the correct procedure. When you write up your results for publication, you can state that classifier performance was compared using the Friedman test with post-hoc pairwise comparisons conducted using the Nemenyi test implemented via MATLAB's multcompare function with Tukey-Kramer critical values when the overall test was significant at alpha equals 0.05.

The fundamental point is this: you have excellent data for this analysis. The toy example failed for legitimate statistical reasons that don't apply to your study. You're using the correct test. Your confusion came from a simple data orientation error that's easy to fix. Whatever results you get, whether showing differences or not, will be scientifically valid and publishable. Don't abandon this approach right when you're on the verge of getting your actual results. Fix the matrix orientation, run the test on your real 1500-patient data, and trust what the statistics tell you.

サインインしてコメントする。

採用された回答

Umar
Umar 2025 年 11 月 28 日 2:03
移動済み: Image Analyst 2025 年 11 月 28 日 3:18

Hi @MByk,

You're actually doing this correctly, and contrary to common belief, MATLAB's multcompare function does perform the equivalent of the Nemenyi post-hoc test when you pass it Friedman test statistics. The Tukey-Kramer method that multcompare uses is mathematically equivalent to Nemenyi for ranked data.

Regarding your results, yes, you're correct that the high p-value of 0.8013 indicates no significant difference between your classifiers. However, there's an important issue with how you set up your data matrix. You have three classifiers and four metrics, but the way you arranged your data, MATLAB interpreted it backwards. The friedman function expects columns to represent the groups you're comparing and rows to represent the repeated measurements. Your current setup has classifiers as rows and metrics as columns, which means MATLAB compared your four metrics instead of your three classifiers. You need to transpose your data matrix like this: friedman(PrfMat.', 1, 'on').

Now let me explain what c and m represent. The c matrix shows pairwise comparisons between groups. The first two columns tell you which two groups are being compared, columns three through five give you the lower confidence limit, the estimated mean difference, and the upper confidence limit for that comparison, and the last column is the p-value for that specific pairwise test. In your case, since there's no significant difference, all the estimated differences are close to zero or small values, and the p-values are all high. The m matrix shows the mean rank and standard error for each group. The first column is the mean rank assigned to each group by the Friedman test, and the second column is the standard error of that rank.

After you transpose your data correctly, you'll see that all three classifiers have identical mean ranks of 2.0 with a Friedman p-value of 1.0, meaning there is absolutely no statistical difference detected between your three classifiers. This isn't because your test is wrong, it's because with only four measurements per classifier and the large variability in your data, there simply isn't enough statistical power to detect any differences. Looking at your actual performance values, each classifier varies wildly across the four metrics, with standard deviations around 0.11 to 0.17, which is huge compared to the small differences between classifier means. This high within-classifier variability masks any between-classifier differences.

As for using Pearson's r instead, that would not be appropriate for this type of comparison. Pearson correlation measures linear relationships between two variables, but you're trying to determine if three classifiers perform differently across multiple metrics, which is a question about group differences, not correlations. The Friedman test is the right choice here because it's a non-parametric test for comparing multiple related groups, which is exactly your situation. Using correlation would be answering a completely different question.

The real issue you're facing is a fundamental statistical power problem. With only four data points per classifier, you cannot reliably distinguish small differences in performance. To get meaningful results, you would need either more performance metrics, more datasets to test on, or metrics with less variability. Given your current data, the statistically honest conclusion is that based on these four metrics, you cannot say that any of the three classifiers performs significantly better or worse than the others.

その他の回答 (2 件)

dpb
dpb 2025 年 11 月 27 日 15:53
移動済み: dpb 2025 年 11 月 27 日 15:53
PrfMat=[0.9352 0.9697 0.7475 0.9877;
0.9670 0.8713 0.8414 0.7052;
0.6944 0.6841 0.9851 0.9897];
[mean(PrfMat); std(PrfMat); std(PrfMat)./mean(PrfMat)]
ans = 3×4
0.8655 0.8417 0.8580 0.8942 0.1491 0.1451 0.1197 0.1637 0.1722 0.1724 0.1395 0.1830
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
boxplot(PrfMat); hAx=gca; hAx.YAxis.TickLabelFormat='%0.2f'; ylim([0.6 1.1])
Note the means are all quite similar as shown by the boxplot; the ranges overlap almost identically. With only three observations and one replication, the power to distinguish any small discrepancies that might be present is very low. Drawing any conclusion otherwise from these data would be impossible by any test statistic I think.
  3 件のコメント
dpb
dpb 2025 年 11 月 27 日 20:30
I presumed your data were for four treatments based on the orientation of the array -- keeping with Matlab general construction, friedman() considers the columns as the effects and the rows observations with the replicates for each grouped sequentially if there are any.
Consequently, we have to transpose the data
PrfMat=[0.9352 0.9697 0.7475 0.9877;
0.9670 0.8713 0.8414 0.7052;
0.6944 0.6841 0.9851 0.9897].'
PrfMat = 4×3
0.9352 0.9670 0.6944 0.9697 0.8713 0.6841 0.7475 0.8414 0.9851 0.9877 0.7052 0.9897
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[mean(PrfMat); std(PrfMat); std(PrfMat)./mean(PrfMat)]
ans = 3×3
0.9100 0.8462 0.8383 0.1105 0.1082 0.1722 0.1214 0.1279 0.2054
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
boxplot(PrfMat); hAx=gca; hAx.YAxis.TickLabelFormat='%0.2f'; ylim([0.6 1.1])
This still shows very little difference between treatments as compared to the in-treatment variability so it's highly unlikely there will be any statistically significant differences, but we can see what it thinks...
[p,~,stats] = friedman(PrfMat, 1,'off')
p = 1
stats = struct with fields:
source: 'friedman' n: 4 meanranks: [2 2 2] sigma: 1
[c,m] = multcompare(stats, 'Display', 'off')
c = 3×6
1.0000 2.0000 -1.6572 0 1.6572 1.0000 1.0000 3.0000 -1.6572 0 1.6572 1.0000 2.0000 3.0000 -1.6572 0 1.6572 1.0000
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
m = 3×2
2.0000 0.5000 2.0000 0.5000 2.0000 0.5000
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
As for the c array, it is described in the output variables section, but the first two columns are the pairwise two being compared followed by the lower-95%, estimate and upper-95% values for the differences in means. The differences are in the effects rankings and since there is no significant difference, the estimate for the mean of each is 2, the median and therefore the differences are all identially 0. The last column is the p value and is, here, identically 1.
Th m values are the estimated means and their standard error; as above, since there is no statistical difference, they're all 2.
Umar
Umar 2025 年 11 月 28 日 2:03

Hi @MByk,

You're actually doing this correctly, and contrary to common belief, MATLAB's multcompare function does perform the equivalent of the Nemenyi post-hoc test when you pass it Friedman test statistics. The Tukey-Kramer method that multcompare uses is mathematically equivalent to Nemenyi for ranked data.

Regarding your results, yes, you're correct that the high p-value of 0.8013 indicates no significant difference between your classifiers. However, there's an important issue with how you set up your data matrix. You have three classifiers and four metrics, but the way you arranged your data, MATLAB interpreted it backwards. The friedman function expects columns to represent the groups you're comparing and rows to represent the repeated measurements. Your current setup has classifiers as rows and metrics as columns, which means MATLAB compared your four metrics instead of your three classifiers. You need to transpose your data matrix like this: friedman(PrfMat.', 1, 'on').

Now let me explain what c and m represent. The c matrix shows pairwise comparisons between groups. The first two columns tell you which two groups are being compared, columns three through five give you the lower confidence limit, the estimated mean difference, and the upper confidence limit for that comparison, and the last column is the p-value for that specific pairwise test. In your case, since there's no significant difference, all the estimated differences are close to zero or small values, and the p-values are all high. The m matrix shows the mean rank and standard error for each group. The first column is the mean rank assigned to each group by the Friedman test, and the second column is the standard error of that rank.

After you transpose your data correctly, you'll see that all three classifiers have identical mean ranks of 2.0 with a Friedman p-value of 1.0, meaning there is absolutely no statistical difference detected between your three classifiers. This isn't because your test is wrong, it's because with only four measurements per classifier and the large variability in your data, there simply isn't enough statistical power to detect any differences. Looking at your actual performance values, each classifier varies wildly across the four metrics, with standard deviations around 0.11 to 0.17, which is huge compared to the small differences between classifier means. This high within-classifier variability masks any between-classifier differences.

As for using Pearson's r instead, that would not be appropriate for this type of comparison. Pearson correlation measures linear relationships between two variables, but you're trying to determine if three classifiers perform differently across multiple metrics, which is a question about group differences, not correlations. The Friedman test is the right choice here because it's a non-parametric test for comparing multiple related groups, which is exactly your situation. Using correlation would be answering a completely different question.

The real issue you're facing is a fundamental statistical power problem. With only four data points per classifier, you cannot reliably distinguish small differences in performance. To get meaningful results, you would need either more performance metrics, more datasets to test on, or metrics with less variability. Given your current data, the statistically honest conclusion is that based on these four metrics, you cannot say that any of the three classifiers performs significantly better or worse than the others.

サインインしてコメントする。


Umar
Umar 2025 年 11 月 28 日 2:03
移動済み: Image Analyst 2025 年 11 月 28 日 3:17

Hi @MByk,

You're actually doing this correctly, and contrary to common belief, MATLAB's multcompare function does perform the equivalent of the Nemenyi post-hoc test when you pass it Friedman test statistics. The Tukey-Kramer method that multcompare uses is mathematically equivalent to Nemenyi for ranked data.

Regarding your results, yes, you're correct that the high p-value of 0.8013 indicates no significant difference between your classifiers. However, there's an important issue with how you set up your data matrix. You have three classifiers and four metrics, but the way you arranged your data, MATLAB interpreted it backwards. The friedman function expects columns to represent the groups you're comparing and rows to represent the repeated measurements. Your current setup has classifiers as rows and metrics as columns, which means MATLAB compared your four metrics instead of your three classifiers. You need to transpose your data matrix like this: friedman(PrfMat.', 1, 'on').

Now let me explain what c and m represent. The c matrix shows pairwise comparisons between groups. The first two columns tell you which two groups are being compared, columns three through five give you the lower confidence limit, the estimated mean difference, and the upper confidence limit for that comparison, and the last column is the p-value for that specific pairwise test. In your case, since there's no significant difference, all the estimated differences are close to zero or small values, and the p-values are all high. The m matrix shows the mean rank and standard error for each group. The first column is the mean rank assigned to each group by the Friedman test, and the second column is the standard error of that rank.

After you transpose your data correctly, you'll see that all three classifiers have identical mean ranks of 2.0 with a Friedman p-value of 1.0, meaning there is absolutely no statistical difference detected between your three classifiers. This isn't because your test is wrong, it's because with only four measurements per classifier and the large variability in your data, there simply isn't enough statistical power to detect any differences. Looking at your actual performance values, each classifier varies wildly across the four metrics, with standard deviations around 0.11 to 0.17, which is huge compared to the small differences between classifier means. This high within-classifier variability masks any between-classifier differences.

As for using Pearson's r instead, that would not be appropriate for this type of comparison. Pearson correlation measures linear relationships between two variables, but you're trying to determine if three classifiers perform differently across multiple metrics, which is a question about group differences, not correlations. The Friedman test is the right choice here because it's a non-parametric test for comparing multiple related groups, which is exactly your situation. Using correlation would be answering a completely different question.

The real issue you're facing is a fundamental statistical power problem. With only four data points per classifier, you cannot reliably distinguish small differences in performance. To get meaningful results, you would need either more performance metrics, more datasets to test on, or metrics with less variability. Given your current data, the statistically honest conclusion is that based on these four metrics, you cannot say that any of the three classifiers performs significantly better or worse than the others.

  8 件のコメント
MByk
MByk 2025 年 11 月 29 日 12:14
Thank you both for the detailed explanations. I wish I understood the topic as well as you do. I'm not a statistician, but I'm trying my best. We're planning to write a paper, and I thought it would be helpful to include a statistical analysis of classification performance rather than simply presenting the results in a table. That's why I asked this question, but from what I've seen, the work isn't just about running tests and sharing the results.
dpb
dpb 2025 年 11 月 29 日 12:38
NOTA BENE: My comment is not to say the Friedman test is the incorrect one, only that
  1. I am always reluctant to make a definitive recommendation without a very detailed knowledge of the application (burnt too many times in former life), and
  2. It's possible there could be alternatives that might have more power if can meet the assumptions rather than the nonparametric test.
@Umar is correct in that a negative result is also a valid conclusion even if it may be disappointing to the researcher that a specific idea doesn't pan out as hoped for..but that's a part of advancing science to weed out what doesn't work as well as find what does.
If you are proposing writing a paper, my last suggestion would be to find a uni consulting statistician with whom you can discuss this; if it will be being submitted to a refereed journal, any bad decisions now will almost certainly be questioned.

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeAnalysis of Variance and Covariance についてさらに検索

製品


リリース

R2025b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by