@MByk, you mentioned, “ The classification performances of this database are quite similar, but I'd like to understand how to interpret the p, c, and m values and would PCC be sufficient instead of running two separate tests? With a high p-value, would we conclude that there is no statistically significant difference between the three classifiers across the four metrics? I'd also like to understand what c and m represent. If I'm not mistaken, pairwise comparisons (Classifier 1 vs. C2, C1 vs. C3, and C2 vs. C3) should also be performed to show where the differences occur, but I don't see these in the results. May be "Multcompere" is not performing what i want. ”
My feedback:please don't give up on the Friedman test because you're actually in an excellent position to get meaningful results. The confusion in this thread comes from everyone analyzing that small toy example you posted at the beginning, which genuinely failed because it only had three classifiers with four measurements each and extremely high variability. That tiny example was statistically underpowered and correctly showed no differences, but your real study is completely different. You mentioned you have 1500 patients with 12 features, testing seven classifiers from the Classification Learner app, and collecting four performance metrics for each classifier. This is actually a robust experimental design with plenty of statistical power to detect real differences if they exist.
The main technical issue you encountered was data orientation. MATLAB's friedman function expects columns to be the groups you're comparing and rows to be the repeated measurements. You had classifiers as rows and metrics as columns, which made MATLAB compare your four metrics instead of your seven classifiers. For your real analysis, you need to create a matrix that's four rows by seven columns, where each row is one of your performance metrics and each column is one of your seven classifiers. So the first row would contain the accuracy scores for all seven classifiers, the second row would be precision for all seven classifiers, the third row recall, and the fourth row F1 scores. Then when you run friedman on this properly arranged matrix, it will actually compare your classifiers across the four metrics.
Regarding the output interpretation, the c matrix from multcompare shows pairwise comparisons between your classifiers. The first two columns tell you which two classifiers are being compared, the middle three columns give you the confidence interval and estimated difference in their rankings, and the last column is the p-value telling you whether that pair differs significantly. The m matrix is simpler, just showing the mean rank for each classifier and its standard error. Lower ranks mean better performance. When you run this on your actual data with seven classifiers, you should get meaningful results because you have 1500 patients worth of data, which is 375 times more than the example that failed.
About using Pearson correlation instead, that's not appropriate for what you're trying to do. Pearson correlation measures whether two variables move together in a linear relationship, but you want to know if your classifiers perform differently from each other. These are fundamentally different statistical questions. The Friedman test is correct for comparing multiple groups that are measured repeatedly, which is exactly your situation with seven classifiers each evaluated on four metrics.
The critical thing to understand is that even if your Friedman test comes back non-significant with a high p-value, that's a valid scientific finding, not a failure of your analysis. It would mean your seven classifiers perform equivalently on your dataset, and you should then choose among them based on practical considerations like training time, interpretability, computational cost, or ease of deployment. Many published studies report no significant differences among methods, and this is honest, valuable scientific information. You would simply report something like "The Friedman test revealed no statistically significant differences among the seven classifiers tested on 1500 patients, suggesting that all performed comparably on this dataset."
Your workflow should be straightforward. First, arrange your data correctly as that four by seven matrix. Second, visualize it with a simple boxplot to see what the distributions look like. Third, run the Friedman test with the command friedman on your matrix with one replication. Fourth, only if the overall Friedman test gives you a p-value less than 0.05 should you proceed with post-hoc pairwise comparisons using multcompare. If the p-value is greater than 0.05, you stop there and report that no significant differences were found. The mistake many researchers make is trying multiple different tests hoping to find significance somewhere, which is statistically invalid. Whatever your Friedman test tells you is the answer, whether significant or not.
The reason both dpb and I spent so much time explaining the example's failure was educational, showing you how sample size and variability affect statistical power. They demonstrated that with only four observations and high variability, you can't detect differences even if they exist. But this lesson doesn't apply to your real study. You have 1500 patients, which provides robust statistical power. The Classification Learner app likely used cross-validation, which means your performance metrics are reliable estimates. Your study design is solid and appropriate for the Friedman test.
If you want to proceed confidently, here's exactly what to do with your real data. Create your performance matrix where PerfMatrix equals a four by seven array, with PerfMatrix row one being all seven accuracy values, row two being all seven precision values, row three being all seven recall values, and row four being all seven F1 scores. Make a boxplot of this matrix to visualize the distributions. Run the Friedman test with the command brackets p comma tbl comma stats close brackets equals friedman open parenthesis PerfMatrix comma one comma quote on quote close parenthesis. Look at the p-value. If it's less than 0.05, run multcompare on the stats output to see which specific pairs differ. If it's greater than 0.05, you're done and can report that all classifiers performed similarly.
The confusion about whether MATLAB's multcompare is equivalent to the Nemenyi post-hoc test is resolved in the literature, which confirms that multcompare with Tukey-Kramer critical values after a Friedman test is mathematically equivalent to the Nemenyi test for ranked data. So you're using the correct procedure. When you write up your results for publication, you can state that classifier performance was compared using the Friedman test with post-hoc pairwise comparisons conducted using the Nemenyi test implemented via MATLAB's multcompare function with Tukey-Kramer critical values when the overall test was significant at alpha equals 0.05.
The fundamental point is this: you have excellent data for this analysis. The toy example failed for legitimate statistical reasons that don't apply to your study. You're using the correct test. Your confusion came from a simple data orientation error that's easy to fix. Whatever results you get, whether showing differences or not, will be scientifically valid and publishable. Don't abandon this approach right when you're on the verge of getting your actual results. Fix the matrix orientation, run the test on your real 1500-patient data, and trust what the statistics tell you.




