I’m very confused on how to practically handle data sets for deep learning. If i want to use DL for some task i (usually) don’t have all possible variations to train the network perfectly. Thus, given some task, someone usually starts by searching for a data set to start with. This data set will then change over time, because more samples of the original data becomes available or new untrained data becomes available. To get an answer to my questions, I would like to sketch how I would handle training. This is also heavily related to the question on how to handle training/test/validation sets.
Let’s assume I want to build a network to recognize an animal based on a picture. At first i start searching for animal pictures and maybe out of luck find a data set of a few hundred pictures of a mixture of exactly one adult cat, dog and horse.
My first attempt would be to just randomly shuffle all animals in a 90% training, 5 % validation and 5 % test set – so all pictures are in a directory on my hard drive and i randomly pick 90% for training, ….
Let’s call this shuffle A(90/5/5).
The network is trained and it’s prediction capabilites are evaluated using the 5% validation set. I’m not very impressed so I change hyperparameters, like a deeper network, larger hidden layers, change learning rate and so on. For every hyperparameter change I retrain the model again with the same 90% set and evaluate it’s performance on the same validation set.
After a few days, I evaluate the best model of my hyperparameter tuning on the 5% test set for the first time. The result is very bad, thus I shuffle again and call it B(90/5/5) and start again with the same trial and error scheme.
Hopefully after a few try’s i get good results on the 5% test set, so i’m finished now.
A few weeks later my dataset changes. I found a new dataset with two new species, one contains hundreds of pictures, the other only 50 pictures. Also I added puppies and kittens to the previous dataset. I repeat the whole process again and hope it get’s better.
My thoughts and questions on this:
So this process is a big trial and error scheme which has the huge drawback, that I start from zero on each data set change. So if I have a good working network, a small change in the data set could lead to an insane amount of training/tuning repetitions which maybe takes very long until I find good parameters.
At first, i would change the 90/5/5 set’s based on the set of different species and categories I have. For example, the pictures are split in different directories on my hard drive as, cats/adult, cats/kittens, dogs/adult, dogs/puppies, and so on. For each of those directories I would then randomly select 90%/5%/5% pictures and use the union of the 90% sets for training.
The idea is to avoid overfitting to the exact given pictures of a species if I
have only a few samples.
The second idea would be to reuse the same network and just increase some parameters instead of starting from scratch or trying a completely different architecture. The idea here is that a bigger data set needs a bigger network to achieve good results. But here is some mistake, I think. Maybe i already overfitted the previous network and it doesn’t make any sense to increase the network?
Do i really need to reshuffle the images again if the test set doesn’t perform good? How to speed up iteration and maybe recycle knowledge of hyperparameters? How do I know which parameters I should change if my dataset changes?
How do big companies handle this problem? How would you handle a speech recognition system if your users generate hundreds of hours of possible data every day or if your users provide thousands of images to train on.