I’ve been looking into semi-supervised learning methods, and have come across the concept of “pseudo-labeling”.
As I understand it, with pseudo-labeling you have a set of labeled data as well as a set of unlabeled data. You first train a model on only the labeled data. You then use that initial data to classify (attach provisional labels to) the unlabeled data. You then feed both the labeled and unlabeled data back into your model training, (re-)fitting to both the known labels and the predicted labels. (Iterate this process, re-labeling with the updated model.)
The claimed benefits are that you can use the information about the structure of the unlabeled data to improve the model. A variation of the following figure is often shown, “demonstrating” that the process can make a more complex decision boundary based on where the (unlabeled) data lies.
However, I’m not quite buying that simplistic explanation. Naively, if the original labeled-only training result was the upper decision boundary, the pseudo-labels would be assigned based on that decision boundary. Which is to say that the left hand of the upper curve would be pseudo-labeled white and the right hand of the lower curve would be pseudo-labeled black. You wouldn’t get the nice curving decision boundary after retraining, as the new pseudo-labels would simply reinforce the current decision boundary.
Or to put it another way, the current labeled-only decision boundary would have perfect prediction accuracy for the unlabeled data (as that’s what we used to make them). There’s no driving force (no gradient) which would cause us to change the location of that decision boundary simply by adding in the pseudo-labeled data.
Am I correct in thinking that the explanation embodied by the diagram is lacking? Or is there something I’m missing? If not, what is the benefit of pseudo-labels, given the pre-retraining decision boundary has perfect accuracy over the pseudo-labels?