What is the test to tell if e.g. an F1 score of 0.69 for classifier A and 0.72 for classifier B is truly different and not just by chance? (for mean-values one would use a "t-test" and obtain a "p-value"). I have access to the underlying data and not only to the F1 scores.
… and how can one estimate the sample size needed in the test-set in order not to miss a true difference between the F1 scores? (as in the example above) (for mean-values one would use a "power analysis"). Or in other words, if I want to know which classifier (A or B) is truly better (to a certain significance level): how many test cases do I need?
Google just returns some research papers but I would need some type of established standard methods for the sample size and significance test (ideally implemented as a python package).
— EDIT —
Thanks for pointing out this post in the comments – it points in the right direction but unfortunately does not solve my two related problems as stated above.