*Bounty: 50*

*Bounty: 50*

What is the test to tell if e.g. an F1 score of 0.69 for classifier A and 0.72 for classifier B is truly different and not just by chance? (for mean-values one would use a "t-test" and obtain a "p-value"). I have access to the underlying data and not only to the F1 scores.

… and how can one estimate the sample size needed in the test-set in order not to miss a true difference between the F1 scores? (as in the example above) (for mean-values one would use a "power analysis"). Or in other words, if I want to know which classifier (A or B) is truly better (to a certain significance level): how many test cases do I need?

Google just returns some research papers but I would need some type of established standard methods for the sample size and significance test (ideally implemented as a python package).

— EDIT —

Thanks for pointing out this post in the comments – it points in the right direction but unfortunately does not solve my two related problems as stated above.