#StackBounty: #bootstrap #aggregation #bagging #out-of-sample #adaboost SVM ensemble methods: bootstrap aggregation & boosting, how…

Bounty: 50

I want to use an SVM with the techniques of bootstrap aggregation and ada boost (or boosting). I am implementing these algorithms myself for a custom ML pipeline.

However, I am not sure of the following questions:

  1. How should I select the number of bags to use. Logically, I know that the maximum number of bags should be less than the number of instances. However, is there a correct or preferred number?
  2. How should I select the percentage of the training set to put into each bag? Should the percentage be the same as the initial training/testing, so if I did 80:20, then I again do 80:20 for the bags?
  3. Should all bags be exactly equal size?
  4. For testing each bag’s performance, (for setting the ada boost weights,) there are at least three options: (a) test the instances that the bag was trained with, (b) test the out-of-bag instances, or (c) test both combined (i.e. the full training set). Which option is correct?
  5. There is an additional issue for SVM specifically: since SVM doesn’t perform well on imbalanced data sets, often we balance the classes within training sets. This leaves a lot of ‘unused’ data (at least in biology – since in biology, we prefer to represent the natural distribution of data [meaning not all unused data can form part of the testing set, as it would unbalance it from it’s natural distribution]). Could/should bag performance be measured with this unused data?
  6. Should scaling of values to [0, 1] or [-1, 1] be done considering the min and max of the values within each bag, or within the whole training set (i.e. should scaling parameters be set including out-of-bag examples or not)?
  7. Is it acceptable to set the bag weights for boosting to an alternative metric than accuracy, for example, the F1 score (to bias performance to a particular class)?

For example, if I have a data set of 200 instances, split 80:20 training:testing, meaning 160:40. Should I fill each bag with 80% again – 128 instances? If I choose 128 instances then how many bags should I use?

Get this bounty!!!

Leave a Reply