#StackBounty: #python #pandas #machine-learning #sentiment-analysis How to assign labels/score to data using machine learning

Bounty: 50

I have a dataframe made by many rows which includes tweets. I would like to classify them using a machine learning technique (supervised or unsupervised).
Since the dataset is unlabelled, I thought to select a few rows (50%) to label manually (+1 pos, -1 neg, 0 neutral), then using machine learning to assign labels to the other rows.
In order to do this, I did as follows:

Original Dataset

Date                   ID        Tweet                         
01/20/2020           4141    The cat is on the table               
01/20/2020           4142    The sky is blue                       
01/20/2020           53      What a wonderful day                  
...
05/12/2020           532     In this extraordinary circumstance we are together   
05/13/2020           12      It was a very bad decision            
05/22/2020           565     I know you are the best              
  1. Split the dataset into 50% train and 50% test. I manually labelled 50% of data as follows:
    Date                   ID        Tweet                          PosNegNeu
     01/20/2020           4141    The cat is on the table               0
     01/20/2020           4142    The weather is bad today              -1
     01/20/2020           53      What a wonderful day                  1
     ...
     05/12/2020           532     In this extraordinary circumstance we are together   1
     05/13/2020           12      It was a very bad decision            -1
     05/22/2020           565     I know you are the best               1
    

Then I extracted words’frequency (after removing stopwords):

               Frequency
 bad               2
 circumstance      1
 best              1
 day               1
 today             1
 wonderful         1

….

I would like to try to assign labels to the other data based on:

  • words within the frequency table, for example saying "if a tweet contains e.g. bad than assign -1; if a tweet contains wonderful assign 1 (i.e. I should create a list of strings and a rule);
  • based on sentence similarity (e.g. using Levenshtein distance).

I know that there are several ways to do this, even better, but I am having some issue to classify/assign labels to my data and I cannot do it manually.

My expected output, e.g. with the following test dataset

Date                   ID        Tweet                                   
06/12/2020           43       My cat 'Sylvester' is on the table            
07/02/2020           75       Laura's pen is black                                                
07/02/2020           763      It is such a wonderful day                                    
...
11/06/2020           1415    No matter what you need to do                  
05/15/2020           64      I disagree with you: I think it is a very bad decision           
12/27/2020           565     I know you can improve                         

should be something like

Date                   ID        Tweet                                   PosNegNeu
06/12/2020           43       My cat 'Sylvester' is on the table            0
07/02/2020           75       Laura's pen is black                          0                       
07/02/2020           763      It is such a wonderful day                    1                
...
11/06/2020           1415    No matter what you need to do                  0  
05/15/2020           64      I disagree with you: I think it is a very bad decision  -1          
12/27/2020           565     I know you can improve                         0   

Probably a better way should be consider n-grams rather than single words or building a corpus/vocabulary to assign a score, then a sentiment. Any advice would be greatly appreciated as it is my first exercise on machine learning. I think that k-means clustering could also be applied, trying to get more similar sentences.
If you could provide me a complete example (with my data would be great, but also with other data would be fine as well), I would really appreciate it.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.