Suppose I have a fixed training data set $D$ and a fixed test data set $F$ and suppose I have an infinite class of models (for example, for simplicity, indexed by a hyperparameter) that can be trained on data.
If I keep training models using $D$ and then evaluate their performance on $F$, in order to find better and better models, won’t I “illegally” incorporate knowledge from the test data set into my model, since I effectively use the test data set to build a model, instead of only evaluating its generalization performance?
I have a vague feeling I should not use the test data set “too often” (whatever “too often” might mean).
(To make my somewhat vague question concrete, one could imagine the model class to consist of neural networks for binary classification of, say, and each neural network to have different architecture. $D$ and $F$ are large sets of labelled images of flowers of type “A” and type “B” and the loss function is the $ell_2$ norm.)