#StackBounty: #machine-learning #predictive-modeling #logistic-regression #svm #naive-bayes-classifier How to improve results in classi…

Bounty: 50

I am new on Machine Learning and building models but a lot of tutorials has given me the chance to learn more about this topic.
I am trying to build a predictive model for detecting fake news.
The percentage of data with labels 1 e 0 is the following:

0    2015
1     798

It is not well balanced, unfortunately, as you can see.
I split the dataset as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)

i.e. 70% train and 30% test.
I hope it makes sense, though I have unbalanced classes.
Then, after cleaning text by removing stopwords and punctuation (should I have done something else?), I ran different models, specifically MultiNaive Bayes, SVM and Logistic Regression, getting the following results:

MNB : 84%

  precision    recall  f1-score   support

           0       0.88      0.90      0.89       476
           1       0.45      0.40      0.42        95

    accuracy                           0.82       571
   macro avg       0.66      0.65      0.66       571
weighted avg       0.81      0.82      0.81       571

SVM: Accuracy: 0.8336252189141856

Precision: 0.5
Recall: 0.2736842105263158 (Terrible results!)

Logistic regression: 0.8546409807355516

All the tutorial show that the steps for building a good model when you have some text, are removing stopwords and punctuation and extra words.
I have done all these things, but probably there will be something that I could do more to improve the results.
I read that, in general, who gets results above 99% met problems like overfitting: however, I would really have liked to get a 92% (at least).
What do you think? How could I improve further the models? Do you think that having unbalanced classes could have affected the results?

Any suggestions would be greatly appreciated it.

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.