I am working on a project where we are coding written survey responses pertaining to a persons mother tongue. For example if the person wrote in "English" it would get coded to 0001 and "Spanish" would get coded to 0002 etc. To do this we have created a reference file that will catch everything we expect to see. For instance the reference file will have English and Spanish etc.
The issue is we have potentially millions of responses written in that may not match to the reference file. For example, spelling mistakes or maybe colloquially written terms, sometimes just nonsense is written in etc. We would like to use machine learning to process these write-ins that the reference file does not catch. The problem is that we do not have "true" values from which to train from beyond the reference file.
We could try using the reference file as a training set but the performance will likely be poor. We do have "experts" that can look at a write-in and assign the correct code so I was wondering if we could build an initial model from the reference file and use active learning to improve it in the following way.
- Build initial model on the reference file
- Select two samples from records not matching reference file; first sample a SRS from the population to be used to analyze performance and the second sample selected from records that were particularly difficult for the model to predict (i.e. those with equal probabilities between classes).
- Have our expert code both samples
- Calculate performance on the first sample
- Retrain model with reference file data plus data from both samples
- Repeat 1-5 until performance stops increasing significantly
Does this approach sound valid? Am I leaking data somehow by doing this? Is there a better approach?