#StackBounty: #neural-network #regression #decision-trees #bert #embeddings Combining heterogeneous numerical and text features

Bounty: 50

We want to solve a regression problem of the form "given two objects $x$ and $y$, predict their score (think about it as a similarity) $w(x,y)$". We have 2 types of features:

  • For each object, we have about 1000 numerical features, mainly of the following types: 1) "Historical score info", e.g. historical means $w(x,cdot)$ up to the point we use the feature; 2) 0/1 features meaning whether object $x$ has a particular attribute, etc.
  • For each object, we have a text which describes the object (description is not reliable, but still useful).

Clearly, when predicting a score for a pair $(x,y)$, we can use features for both $x$ and $y$.

We are currently using the following setup (I omit validation/testing):

  • For texts, we compute their BERT embeddings and then produce a feature based on the similarity between the embedding vectors (e.g. cosine similarity between them).
  • We split the dataset into fine-tuning and training datasets. The fine-tuning dataset may be empty meaning no fine-tuning.
  • Using the fine-tuning dataset, we fine-tune BERT embeddings.
  • Using the training dataset, we train decision trees to predict the scores.

We compare the following approaches:

  • Without BERT features.
  • Using BERT features, but without fine-tuning. There is some reasonable improvement in prediction accuracy.
  • Using BERT features, with fine-tuning. The improvement is very small (but the prediction using only BERT features improved, of course).

Question: Is there something simple I’m missing in this approach? E.g. maybe there are better ways to use texts? Other ways to use embeddings? Better approaches compared with decision trees?

I tried to do multiple things, without any success. The approaches which I expected to provide improvements are the following:

  • Fine-tune embeddings to predict difference between $w(x,y)$ and mean $w(x, cdot)$. The motivation is that we already have a feature "mean $w(x,cdot)$", which is a baseline for an object $x$, and we are interested in the deviation from this mean.

  • Use NN instead of decision trees. Namely, I use few dense layers to turn embedding vectors into features, like this:

     nn.Sequential(
          nn.Linear(768 * 2, 1000),
          nn.BatchNorm1d(1000),
          nn.ReLU(),
          nn.Linear(1000, 500),
          nn.BatchNorm1d(500),
          nn.ReLU(),
          nn.Linear(500, 100),
          nn.BatchNorm1d(100),
          nn.ReLU(),
          nn.Linear(100, 10),
          nn.BatchNorm1d(10),
          nn.ReLU(),
      )
    

    After that, I combine these new $10$ features with $2000$ features I already have, and use similar architecture on top of them:

      nn.Sequential(
          nn.Linear(10 + n_features, 1000),
          nn.BatchNorm1d(1000),
          nn.ReLU(),
          nn.Linear(1000, 500),
          nn.BatchNorm1d(500),
          nn.ReLU(),
          nn.Linear(500, 100),
          nn.BatchNorm1d(100),
          nn.ReLU(),
          nn.Linear(100, 1),
      )
    

But as a result, my prediction is much worse compared with decision trees. Are there better architectures suited for my case?


Get this bounty!!!

#StackBounty: #random-forest #decision-trees #machine-learning-model What is the best way to train a model?

Bounty: 50

I am trying to train my model for sports predictions.

The data frame is as a below given example:

     datetime             country    league                        home_team            away_team              home_odds    draw_odds    away_odds    home_score    away_score
---  -------------------  ---------  ----------------------------  -------------------  -------------------  -----------  -----------  -----------  ------------  ------------
  0  2020-02-22 14:00:00  Albania    First Division                Dinamo Tirana        Beselidhja Lezha            4.66         3.74         1.59             2             0
  1  2020-02-16 14:00:00  Albania    First Division                Beselidhja Lezha     Burreli                     1.82         3            4.42             2             1
  2  2020-02-08 14:00:00  Albania    First Division                Terbuni              Koplik                      1.41         4.2          5.85             2             1
  3  2020-01-26 13:00:00  Albania    First Division                Dinamo Tirana        Egnatia Rrogozhine          2.51         2.98         2.64             0             0
  4  2020-01-25 13:00:00  Albania    First Division                Elbasani             Oriku                       2.36         3.2          2.66             2             0

What would be the best way to train the model for predictions?

The training data is a database of all the soccer competitions and teams.

  • Should I be training the model with the competitions in the testing data (filter out all of the rest and keep the competition that the team has played before or is playing) and then predict?

or

  • Keep the training data as is and predict?

Because a team has data outside the competitions as well.

Example:

Chelsea has played in the FA Cup, Champions League, Premier League and other competitions. I want to predict a Chelsea’s match for the Champions League. Should I be taking training data for Chelsea for all the competitions or should I filter training data for Chelsea for just the Champions league?

What could be defined as ‘noise’ in such a model?

What is the most useful approach data science wise?


Get this bounty!!!

#StackBounty: #machine-learning #python #decision-trees #text-mining #unsupervised-learning Better approach to assign values to determi…

Bounty: 100

I am trying to assign different values for each sentences based on information about the presence of hashtags, upper case letters/words (e.g. HATE) and some others.

I created a data frame which includes some binary values (1 or 0):

Sentence           Upper case   Hashtags
 
I HATE migrants       1             0
I like cooking        0             0
#trump said he is ok  0             1
#blacklives SUPPORT   1             1  

I would like to assign a value based on the binary values above, if they are satisfied or not, for example:

- if Upper case = 1 and Hashtags = 1 then assign -10;
- if Upper case = 1 and Hashtags = 0 then assign -5;
- if Upper case = 0 and Hashtags = 1 then assign -5;
- if Upper case = 0 and Hashtags = 0 then assign 0;

This would be ok for a small number of requests and combinations, but with three variables to check, it would be a greater number of combination to consider manually!
Do you know if there is a way to take into account all these in an easy (and feasible) way?

Someone told me about using regression, but I have never used it before for similar task. The context is about fake tweets.


Get this bounty!!!

#StackBounty: #scikit-learn #decision-trees #accuracy Evaluating Model Accuracy on a testing data set for a DecisionTreeReegressor Model

Bounty: 50

I am trying an exercise where I have been asked to “Evaluate each model accuracy on testing data set for a max_depth parameter value changing from 2 to 5”.

The model here is DecisionTreeRegressor. I just wanted to know what is the metric for calculating the Accuracy for a DecisionTreeRegressor model.

My understanding is that it’s same as Score which can be calculated simply as regressor.score(X_test, Y_test)

Please let me know what should be used to calculate the Accuracy of the DecisionTreeRegressor Model.


Get this bounty!!!

#StackBounty: #machine-learning #scikit-learn #decision-trees is it possible to output more than 2 nodes away from a node in a decision…

Bounty: 50

usually a decision tree has one root node, some nodes, and some leaves.

lots tutorial illustrate this as something like binary tree.

is it possible more than 2 nodes away from a node in a decision tree?

enter image description here

this image comes from this post

by “more than 2 nodes”, i mean there are more than 3 splits (in this case, 3, Low, Med, High) away from the root node.

if it is reasonable in real life application, plz provide an open dataset on which a decision tree would spit more than 2 nodes, and a piece of sklearn code.


Get this bounty!!!