I’m trying to understand a nice blog on the trade-off between sensitivity versus specificity with the random forest and logistic regression models. I have a few questions:
1) The blog used a 10 fold cross-validation in the
ranger package in
R (see the model
mod_rf) and set the metric as ROC. So, is the final output (confusion matrix) we get there is for the one with the best ROC (AUC value) among the 10 validation sets?
2) When I try to see the variable importance by
varImp(mod_rf), it says the importance values are not available. Why is that? How can I get it?
caret package in
R allows upsampling to adjust for an imbalance in the data. They tried the logistic regression (see model
sim_glm) to do upsampling and specified
repeats = 2 to repeat the 10-fold cross-validation 2 times. How does it work? I’m not clear. Does it upsample females to create a 50-50 ratio of males and females “before” each fold of cross-validation? How would the process work for upsampling,
repeats = 2 and 10-fold cross-validation in the case of a random forest?
4) If the AUC (from ROC curve) in my training data is about 10 percentage points less than the AUC from the test data, how should I explain that (this happened to my data)? I thought the training data would always show higher AUC than the test data because we used training data to build our model.
I appreciate your responses.