#StackBounty: #python #logistic-regression #supervised-learning #text-classification How to include categorical fields to enhance a tex…

Bounty: 50

I would have a question on how to add more categorical fields in a classification problem.
My dataset had initially 4 fields:

Date             Text                            Short_Mex                        Username        Label
01/01/2020       I am waiting for the TRAIN      A train is coming                Ludo       1
01/01/2020       you need to keep distance       Social Distance is mandatory     wgriws    0
...
02/01/2020       trump declared war against CHINESE technology      China’s technology is out of the games      Fwu32      1

I joined this dataset to a new one with labels, having values 1 or 0. This will need for classification.

However I have extracted also other fields from my original dataset such as number of characters, upper case words, top frequent terms, and so on.
Some of these fields may be useful for a classification, since I can assign more ‘weight’ based on a word in upper case rather than lower case.

So I would need to use a new dataset with these fields:

  Date             Text                            Short_Mex                        Username    Upper    Label
    01/01/2020       I am waiting for the TRAIN      A train is coming                Ludo    [TRAIN]       1
    01/01/2020       you need to keep distance       Social Distance is mandatory     wgriws       []      0
    ...
    02/01/2020       trump declared war against CHINESE technology      China’s technology is out of the games      Fwu32    [CHINESE]       1
...

I would like to ask you how to add this information (upper case) as a new info for my classifier.
What I am doing is currently the following:

#Train-test split
x_train,x_test,y_train,y_test = train_test_split(df['Text'], news.target, test_size=0.2, random_state=1)




    #Logistic regression classification
    pipe1 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LogisticRegression())])
    
    model_lr = pipe1.fit(x_train, y_train)

lr_pred = model_lr.predict(x_test)

print("Accuracy of Logistic Regression Classifier: {}%".format(round(accuracy_score(y_test, lr_pred)*100,2)))
print("nConfusion Matrix of Logistic Regression Classifier:n")
print(confusion_matrix(y_test, lr_pred))
print("nCLassification Report of Logistic Regression Classifier:n")
print(classification_report(y_test, lr_pred))


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.