*Bounty: 50*

*Bounty: 50*

I’m exploring some ML strategies using `caret`

package. My goal is to select best predictors and to obtain optimal model for further predictions. My dataset is:

- 75 observations (39 S and 36 F – dependent variable named ‘group’) – dataset is well balanced
- 13 independent variables (predictors), continous values from 0 to 1, without any NA’s named:
A_1, A_2, A_3, A_4, A_5, B_1, B_2, C_1, C_2, C_3, D_1, D_2, E_1

Moreover, values of each predictor (F vs S) significantly differ (Wilcoxon test).

I started division of the data and 10-fold cross validation:

```
set.seed(355)
trainIndex <- createDataPartition(data$group, p = 0.7, list = FALSE)
trainingSet <- data[trainIndex,]
testSet <- data[-trainIndex,]
methodCtrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
savePredictions = "final",
classProbs = T,
summaryFunction = twoClassSummary
)
```

Then, based on several articles and tutorials I selected six ML methods to obtain some models with all predictor variables:

```
rff <- train(group ~., data = trainingSet, method = "rf", metric = "ROC", trControl = methodCtrl)
nbb <- train(group ~., data = trainingSet, method = "nb", metric = "ROC", trControl = methodCtrl)
glmm <- train(group ~., data = trainingSet, method = "glm", metric = "ROC", trControl = methodCtrl)
nnett <- train(group ~., data = trainingSet, method = "nnet", metric = "ROC", trControl = methodCtrl)
glmnett <- train(group ~., data = trainingSet, method = "glmnet", metric = "ROC", trControl = methodCtrl)
svmRadiall <- train(group ~., data = trainingSet, method = "svmRadial", metric = "ROC", trControl = methodCtrl)
```

How accurate are the models?

```
fitted <- predict(rff, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#61 #61
fitted <- predict(nbb, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#66 #66
fitted <- predict(glmm, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#57 #66
fitted <- predict(nnett, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#42 #66
fitted <- predict(glmnett, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#61 #57
fitted <- predict(svmRadiall, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#66 #66
```

After first `#`

I put the % of accuracy of prediction of each model. I also draw simple ROC comparison of all models:

Now I’d like to improve my model, so I used `glmStepAIC`

to get only best (most important) predictors, here’s what I got:

```
aic <- train(group ~., data = trainingSet, method = "glmStepAIC", trControl = methodCtrl, metric = "ROC", trace = FALSE)
summary(aic$finalModel)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0191 -0.6077 0.3584 0.6991 2.5416
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.39809 1.01733 0.391 0.69557
A_5 0.11726 0.04701 2.494 0.01263 *
C_2 0.17789 0.11084 1.605 0.10852
C_3 -0.18231 0.11027 -1.653 0.09828 .
E_1 -0.14176 0.05260 -2.695 0.00704 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 74.786 on 53 degrees of freedom
Residual deviance: 48.326 on 49 degrees of freedom
AIC: 58.326
Number of Fisher Scoring iterations: 5
```

Based on this result I chose this 4 predictor variables:

```
data <- read.table("data.txt", sep ='t',header = T, dec = ',')
data <- data[,c('group','A_5','C_2','C_3','E_1')]
```

And I repeated everything, data division train – test, model obtain, model testing etc **only with these 4 predictors instead of all 13**. Unfortunately the accuracy is still low, take a look of % after second `#`

in `confusionMatrix`

part. Moreover the ROC comparison is even worse:

I’m a new in such analysis, so could you please tell me if I’m making some bad mistake in my analysis or maybe my dataset is to small/my data is rubish?

How can I choose optimal predictors to get best model? Which ML methos should I pick?

Best Regards,

Adam