I want to use an SVM with the techniques of bootstrap aggregation and ada boost (or boosting). I am implementing these algorithms myself for a custom ML pipeline.
However, I am not sure of the following questions:
- How should I select the number of bags to use. Logically, I know that the maximum number of bags should be less than the number of instances. However, is there a correct or preferred number?
- How should I select the percentage of the training set to put into each bag? Should the percentage be the same as the initial training/testing, so if I did 80:20, then I again do 80:20 for the bags?
- Should all bags be exactly equal size?
- For testing each bag’s performance, (for setting the ada boost weights,) there are at least three options: (a) test the instances that the bag was trained with, (b) test the out-of-bag instances, or (c) test both combined (i.e. the full training set). Which option is correct?
- There is an additional issue for SVM specifically: since SVM doesn’t perform well on imbalanced data sets, often we balance the classes within training sets. This leaves a lot of ‘unused’ data (at least in biology – since in biology, we prefer to represent the natural distribution of data [meaning not all unused data can form part of the testing set, as it would unbalance it from it’s natural distribution]). Could/should bag performance be measured with this unused data?
- Should scaling of values to [0, 1] or [-1, 1] be done considering the min and max of the values within each bag, or within the whole training set (i.e. should scaling parameters be set including out-of-bag examples or not)?
- Is it acceptable to set the bag weights for boosting to an alternative metric than accuracy, for example, the F1 score (to bias performance to a particular class)?
For example, if I have a data set of 200 instances, split 80:20 training:testing, meaning 160:40. Should I fill each bag with 80% again – 128 instances? If I choose 128 instances then how many bags should I use?