*Bounty: 50*

*Bounty: 50*

I want to use a statistical test to show that there is a significant difference between 9 algorithms over 2 datasets. I have read Demsar’s paper and used Friedman’s test (also used Iman’s corrected Friedman test). The test find a p-value of 0.08 which is higher than the 5% threshold thus the null hypothesis is not rejected (null-h: all algorithms are equal). If you look at the accuracies, it is obvious that might not be the case (I might be mistaken). I then run it through a post-hoc method to adjust the p-values using the Bergmann adjustment or Bonferroni. Below are the accuracies of each algorithm on the 2 datasets and my R code results. Can you please help me understand them and potentially find a better solution? Basically, I am trying to compare the algorithms to the first algorithm in the table which I wrote myself (Clf1).

Accuracies:

Ranking of Classifiers:

```
X1 X2 X3 X4 X5 X6 X7 X8 X9
[1,] 1 8 6 4 7 2 5 3 9
[2,] 1 6 3 9 7 2 8 4 5
```

Friedman Test:

```
Friedman's rank sum test
data: data
Friedman's chi-squared = 11.733, df = 8, p-value = 0.1635
```

This gives me a p-value of 16%! Which is not good. I assume this is due to small number of datasets I have tested on. If I use the adjusted Friedman test:

Iman’s adjustment to F-test:

```
Iman Davenport's correction of Friedman's rank sum test
data: data
Corrected Friedman's chi-squared = 2.75, df1 = 8, df2 = 8, p-value = 0.08697
```

I get 8% which is better and can be acceptable. Now, I must use a post-hoc method to evaluate the p-values and test the hypothesis:

Nemenyi Test:

```
Nemenyi test
data: data
Critical difference = 10.834, k = 9, df = 9
```

Plotting the confusion matrix shows this:

```
> abs(test$diff.matrix) > test$statistic
X1 X2 X3 X4 X5 X6 X7 X8 X9
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
```

If I do a Bergmann adjustment:

```
> raw.p <- friedmanPost(data)
> adjustBergmannHommel(raw.p)
Applying Bergmann Hommel correction to the p-values computed in pairwise comparisons of 9 algorithms. This requires checking54466sets of hypothesis. It may take a few seconds.
X1 X2 X3 X4 X5 X6 X7 X8 X9
X1 NA 1 1 1 1 1 1 1 1
X2 1 NA 1 1 1 1 1 1 1
X3 1 1 NA 1 1 1 1 1 1
X4 1 1 1 NA 1 1 1 1 1
X5 1 1 1 1 NA 1 1 1 1
X6 1 1 1 1 1 NA 1 1 1
X7 1 1 1 1 1 1 NA 1 1
X8 1 1 1 1 1 1 1 NA 1
X9 1 1 1 1 1 1 1 1 NA
```

The adjustment just results in these p-values which have no significance. I would greatly appreciate your explanation in what I am doing wrong and what I can do to fix this problem.

Thank you in advance,