#StackBounty: #classification #cross-validation #scikit-learn #hyperparameter #ensemble Should I perform nested CV with Grid Search to …

Bounty: 50

I’m doing classification of 8 types of hand gestures with stacking models. For that I initially split the data into training and test sets. Then I used GridSerachCV to tune the hyper-parameters.

Here’s the code :

param_grid = [
    
    {
        #Random forest
        'bootstrap': [True, False],
        'max_depth': [40, 50, 60, 70, 80],
        #'max_features': [2, 3],
        'min_samples_leaf': [3, 4, 5],
        'min_samples_split': [8, 10, 12],
        'n_estimators': [10, 15, 20, 25],
        'criterion' : ['gini', 'entropy'],
        'random_state' : [45]
    },
    
    {   
        #K Nearest Neighbours
        'n_neighbors':[5,6,7,9,11],
        'leaf_size':[1,3,5,7],
        'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
        'metric':['euclidean', 'manhattan'] 
        
    },
    
    {
        #SVM
        'C': list(np.arange(1, 5, 0.01)),
        'gamma': ['scale', 'auto'],
        'kernel': ['rbf', 'poly', 'sigmoid', 'linear'],
        'decision_function_shape': ['ovo', 'ovr'],
        'random_state' : [45]
    }    
] 

models_to_train = [RandomForestClassifier(), KNeighborsClassifier(), svm.SVC()]

final_models = []
for i, model in enumerate(models_to_train):
    params = param_grid[i]
    
    clf = GridSearchCV(estimator=model, param_grid=params, cv=20, scoring = 'accuracy').fit(data_train, label_train)
    final_models.append(clf.best_estimator_)

Now, I trained the best models, output by GridSearchCV, on the training data and evaluated it on the test data:

estimators = [
    ('rf', final_models[0]),
    ('knn', final_models[1])                 
]
clf = StackingClassifier(
    estimators=estimators, final_estimator=final_models[2]
)

category_predicted = clf.fit(data_train, label_train).predict(data_test)
acc = accuracy_score(label_test, category_predicted) * 100

My doubt is:

I performed train-test split in the beginning and I didn’t use nested CV because I thought it would increase time complexity a lot as I was using ensemble model. The model produced very good accuracy, more than 95%. Is there a high possibility that the model may give very low accuracy if the train-test split changes? So, should I stop doing train-test split in the beginning and should perform nested CV with Grid Search on the entire data (like what is described here )?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.