*Bounty: 50*

*Bounty: 50*

Random forests have their variable importance calculated using one of two methods, of which permutation-based importance is considered better. In R’s `randomForest`

package, this returns a measure called `%IncMSE`

(or percentage increase in mean squared error) for regression cases. The calculation is explained clearly by @SorenHavelundWelling in this answer:

%IncMSE is the most robust and informative measure. It is the increase

in mse of predictions(estimated with out-of-bag-CV) as a result of

variable j being permuted(values randomly shuffled).

- grow regression forest. Compute OOB-mse, name this mse0.
- for 1 to j var: permute values of column j, then predict and compute OOB-mse(j)
- %IncMSE of j’th is (mse(j)-mse0)/mse0 * 100%

This matches my prior understanding as well. But the package documentation explains things this way (emphasis mine):

The first measure [%IncMSE] is computed from permuting OOB data: For each tree,

the prediction error on the out-of-bag portion of the data is recorded

(error rate for classification, MSE for regression). Then the same is

done after permuting each predictor variable. The difference between

the two are then averaged over all trees, andnormalized by the. If the standard deviation of

standard deviation of the differences

the differences is equal to 0 for a variable, the division is not done

(but the average is almost always equal to 0 in that case).

This normalization step does not really make sense to me, and is omitted by the answer quoted above as well. Why is this part of the calculation? And does the resulting value retain the meaning of a percentage change after this normalization? It wouldn’t seem like it to me.