#StackBounty: #neural-network #regression #decision-trees #bert #embeddings Combining heterogeneous numerical and text features

Bounty: 50

We want to solve a regression problem of the form "given two objects $x$ and $y$, predict their score (think about it as a similarity) $w(x,y)$". We have 2 types of features:

  • For each object, we have about 1000 numerical features, mainly of the following types: 1) "Historical score info", e.g. historical means $w(x,cdot)$ up to the point we use the feature; 2) 0/1 features meaning whether object $x$ has a particular attribute, etc.
  • For each object, we have a text which describes the object (description is not reliable, but still useful).

Clearly, when predicting a score for a pair $(x,y)$, we can use features for both $x$ and $y$.

We are currently using the following setup (I omit validation/testing):

  • For texts, we compute their BERT embeddings and then produce a feature based on the similarity between the embedding vectors (e.g. cosine similarity between them).
  • We split the dataset into fine-tuning and training datasets. The fine-tuning dataset may be empty meaning no fine-tuning.
  • Using the fine-tuning dataset, we fine-tune BERT embeddings.
  • Using the training dataset, we train decision trees to predict the scores.

We compare the following approaches:

  • Without BERT features.
  • Using BERT features, but without fine-tuning. There is some reasonable improvement in prediction accuracy.
  • Using BERT features, with fine-tuning. The improvement is very small (but the prediction using only BERT features improved, of course).

Question: Is there something simple I’m missing in this approach? E.g. maybe there are better ways to use texts? Other ways to use embeddings? Better approaches compared with decision trees?

I tried to do multiple things, without any success. The approaches which I expected to provide improvements are the following:

  • Fine-tune embeddings to predict difference between $w(x,y)$ and mean $w(x, cdot)$. The motivation is that we already have a feature "mean $w(x,cdot)$", which is a baseline for an object $x$, and we are interested in the deviation from this mean.

  • Use NN instead of decision trees. Namely, I use few dense layers to turn embedding vectors into features, like this:

     nn.Sequential(
          nn.Linear(768 * 2, 1000),
          nn.BatchNorm1d(1000),
          nn.ReLU(),
          nn.Linear(1000, 500),
          nn.BatchNorm1d(500),
          nn.ReLU(),
          nn.Linear(500, 100),
          nn.BatchNorm1d(100),
          nn.ReLU(),
          nn.Linear(100, 10),
          nn.BatchNorm1d(10),
          nn.ReLU(),
      )
    

    After that, I combine these new $10$ features with $2000$ features I already have, and use similar architecture on top of them:

      nn.Sequential(
          nn.Linear(10 + n_features, 1000),
          nn.BatchNorm1d(1000),
          nn.ReLU(),
          nn.Linear(1000, 500),
          nn.BatchNorm1d(500),
          nn.ReLU(),
          nn.Linear(500, 100),
          nn.BatchNorm1d(100),
          nn.ReLU(),
          nn.Linear(100, 1),
      )
    

But as a result, my prediction is much worse compared with decision trees. Are there better architectures suited for my case?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.