Random forests have their variable importance calculated using one of two methods, of which permutation-based importance is considered better. In R’s
randomForest package, this returns a measure called
%IncMSE (or percentage increase in mean squared error) for regression cases. The calculation is explained clearly by @SorenHavelundWelling in this answer:
%IncMSE is the most robust and informative measure. It is the increase
in mse of predictions(estimated with out-of-bag-CV) as a result of
variable j being permuted(values randomly shuffled).
- grow regression forest. Compute OOB-mse, name this mse0.
- for 1 to j var: permute values of column j, then predict and compute OOB-mse(j)
- %IncMSE of j’th is (mse(j)-mse0)/mse0 * 100%
This matches my prior understanding as well. But the package documentation explains things this way (emphasis mine):
The first measure [%IncMSE] is computed from permuting OOB data: For each tree,
the prediction error on the out-of-bag portion of the data is recorded
(error rate for classification, MSE for regression). Then the same is
done after permuting each predictor variable. The difference between
the two are then averaged over all trees, and normalized by the
standard deviation of the differences. If the standard deviation of
the differences is equal to 0 for a variable, the division is not done
(but the average is almost always equal to 0 in that case).
This normalization step does not really make sense to me, and is omitted by the answer quoted above as well. Why is this part of the calculation? And does the resulting value retain the meaning of a percentage change after this normalization? It wouldn’t seem like it to me.