I am new on Machine Learning and building models but a lot of tutorials has given me the chance to learn more about this topic.
I am trying to build a predictive model for detecting fake news.
The percentage of data with labels 1 e 0 is the following:
T 0 2015 1 798
It is not well balanced, unfortunately, as you can see.
I split the dataset as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)
i.e. 70% train and 30% test.
I hope it makes sense, though I have unbalanced classes.
Then, after cleaning text by removing stopwords and punctuation (should I have done something else?), I ran different models, specifically MultiNaive Bayes, SVM and Logistic Regression, getting the following results:
MNB : 84% precision recall f1-score support 0 0.88 0.90 0.89 476 1 0.45 0.40 0.42 95 accuracy 0.82 571 macro avg 0.66 0.65 0.66 571 weighted avg 0.81 0.82 0.81 571
SVM: Accuracy: 0.8336252189141856
Recall: 0.2736842105263158 (Terrible results!)
Logistic regression: 0.8546409807355516
All the tutorial show that the steps for building a good model when you have some text, are removing stopwords and punctuation and extra words.
I have done all these things, but probably there will be something that I could do more to improve the results.
I read that, in general, who gets results above 99% met problems like overfitting: however, I would really have liked to get a 92% (at least).
What do you think? How could I improve further the models? Do you think that having unbalanced classes could have affected the results?
Any suggestions would be greatly appreciated it.