I have a dataframe made by many rows which includes tweets. I would like to classify them using a machine learning technique (supervised or unsupervised).
Since the dataset is unlabelled, I thought to select a few rows (50%) to label manually (+1 pos, -1 neg, 0 neutral), then using machine learning to assign labels to the other rows.
In order to do this, I did as follows:
Date ID Tweet 01/20/2020 4141 The cat is on the table 01/20/2020 4142 The sky is blue 01/20/2020 53 What a wonderful day ... 05/12/2020 532 In this extraordinary circumstance we are together 05/13/2020 12 It was a very bad decision 05/22/2020 565 I know you are the best
- Split the dataset into 50% train and 50% test. I manually labelled 50% of data as follows:
Date ID Tweet PosNegNeu 01/20/2020 4141 The cat is on the table 0 01/20/2020 4142 The weather is bad today -1 01/20/2020 53 What a wonderful day 1 ... 05/12/2020 532 In this extraordinary circumstance we are together 1 05/13/2020 12 It was a very bad decision -1 05/22/2020 565 I know you are the best 1
Then I extracted words’frequency (after removing stopwords):
Frequency bad 2 circumstance 1 best 1 day 1 today 1 wonderful 1
I would like to try to assign labels to the other data based on:
- words within the frequency table, for example saying "if a tweet contains e.g. bad than assign -1; if a tweet contains wonderful assign 1 (i.e. I should create a list of strings and a rule);
- based on sentence similarity (e.g. using Levenshtein distance).
I know that there are several ways to do this, even better, but I am having some issue to classify/assign labels to my data and I cannot do it manually.
My expected output, e.g. with the following test dataset
Date ID Tweet 06/12/2020 43 My cat 'Sylvester' is on the table 07/02/2020 75 Laura's pen is black 07/02/2020 763 It is such a wonderful day ... 11/06/2020 1415 No matter what you need to do 05/15/2020 64 I disagree with you: I think it is a very bad decision 12/27/2020 565 I know you can improve
should be something like
Date ID Tweet PosNegNeu 06/12/2020 43 My cat 'Sylvester' is on the table 0 07/02/2020 75 Laura's pen is black 0 07/02/2020 763 It is such a wonderful day 1 ... 11/06/2020 1415 No matter what you need to do 0 05/15/2020 64 I disagree with you: I think it is a very bad decision -1 12/27/2020 565 I know you can improve 0
Probably a better way should be consider n-grams rather than single words or building a corpus/vocabulary to assign a score, then a sentiment. Any advice would be greatly appreciated as it is my first exercise on machine learning. I think that k-means clustering could also be applied, trying to get more similar sentences.
If you could provide me a complete example (with my data would be great, but also with other data would be fine as well), I would really appreciate it.