# #StackBounty: #hypothesis-testing #statistical-significance #p-value #friedman-test Statistical Test Evaluation?

### Bounty: 50

I want to use a statistical test to show that there is a significant difference between 9 algorithms over 2 datasets. I have read Demsar’s paper and used Friedman’s test (also used Iman’s corrected Friedman test). The test find a p-value of 0.08 which is higher than the 5% threshold thus the null hypothesis is not rejected (null-h: all algorithms are equal). If you look at the accuracies, it is obvious that might not be the case (I might be mistaken). I then run it through a post-hoc method to adjust the p-values using the Bergmann adjustment or Bonferroni. Below are the accuracies of each algorithm on the 2 datasets and my R code results. Can you please help me understand them and potentially find a better solution? Basically, I am trying to compare the algorithms to the first algorithm in the table which I wrote myself (Clf1).

Accuracies:

Ranking of Classifiers:

``````     X1 X2 X3 X4 X5 X6 X7 X8 X9
[1,]  1  8  6  4  7  2  5  3  9
[2,]  1  6  3  9  7  2  8  4  5
``````

Friedman Test:

``````      Friedman's rank sum test

data:  data
Friedman's chi-squared = 11.733, df = 8, p-value = 0.1635
``````

This gives me a p-value of 16%! Which is not good. I assume this is due to small number of datasets I have tested on. If I use the adjusted Friedman test:

``````     Iman Davenport's correction of Friedman's rank sum test

data:  data
Corrected Friedman's chi-squared = 2.75, df1 = 8, df2 = 8, p-value = 0.08697
``````

I get 8% which is better and can be acceptable. Now, I must use a post-hoc method to evaluate the p-values and test the hypothesis:

Nemenyi Test:

``````     Nemenyi test

data:  data
Critical difference = 10.834, k = 9, df = 9
``````

Plotting the confusion matrix shows this:

``````> abs(test\$diff.matrix) > test\$statistic
X1    X2    X3    X4    X5    X6    X7    X8    X9
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
``````

If I do a Bergmann adjustment:

``````> raw.p <- friedmanPost(data)
Applying Bergmann Hommel correction to the p-values computed in pairwise comparisons of 9 algorithms. This requires checking54466sets of hypothesis. It may take a few seconds.
X1 X2 X3 X4 X5 X6 X7 X8 X9
X1 NA  1  1  1  1  1  1  1  1
X2  1 NA  1  1  1  1  1  1  1
X3  1  1 NA  1  1  1  1  1  1
X4  1  1  1 NA  1  1  1  1  1
X5  1  1  1  1 NA  1  1  1  1
X6  1  1  1  1  1 NA  1  1  1
X7  1  1  1  1  1  1 NA  1  1
X8  1  1  1  1  1  1  1 NA  1
X9  1  1  1  1  1  1  1  1 NA
``````

The adjustment just results in these p-values which have no significance. I would greatly appreciate your explanation in what I am doing wrong and what I can do to fix this problem.