# #StackBounty: #neural-network #regression #decision-trees #bert #embeddings Combining heterogeneous numerical and text features

### Bounty: 50

We want to solve a regression problem of the form "given two objects $$x$$ and $$y$$, predict their score (think about it as a similarity) $$w(x,y)$$". We have 2 types of features:

• For each object, we have about 1000 numerical features, mainly of the following types: 1) "Historical score info", e.g. historical means $$w(x,cdot)$$ up to the point we use the feature; 2) 0/1 features meaning whether object $$x$$ has a particular attribute, etc.
• For each object, we have a text which describes the object (description is not reliable, but still useful).

Clearly, when predicting a score for a pair $$(x,y)$$, we can use features for both $$x$$ and $$y$$.

We are currently using the following setup (I omit validation/testing):

• For texts, we compute their BERT embeddings and then produce a feature based on the similarity between the embedding vectors (e.g. cosine similarity between them).
• We split the dataset into fine-tuning and training datasets. The fine-tuning dataset may be empty meaning no fine-tuning.
• Using the fine-tuning dataset, we fine-tune BERT embeddings.
• Using the training dataset, we train decision trees to predict the scores.

We compare the following approaches:

• Without BERT features.
• Using BERT features, but without fine-tuning. There is some reasonable improvement in prediction accuracy.
• Using BERT features, with fine-tuning. The improvement is very small (but the prediction using only BERT features improved, of course).

Question: Is there something simple I’m missing in this approach? E.g. maybe there are better ways to use texts? Other ways to use embeddings? Better approaches compared with decision trees?

I tried to do multiple things, without any success. The approaches which I expected to provide improvements are the following:

• Fine-tune embeddings to predict difference between $$w(x,y)$$ and mean $$w(x, cdot)$$. The motivation is that we already have a feature "mean $$w(x,cdot)$$", which is a baseline for an object $$x$$, and we are interested in the deviation from this mean.

• Use NN instead of decision trees. Namely, I use few dense layers to turn embedding vectors into features, like this:

`````` nn.Sequential(
nn.Linear(768 * 2, 1000),
nn.BatchNorm1d(1000),
nn.ReLU(),
nn.Linear(1000, 500),
nn.BatchNorm1d(500),
nn.ReLU(),
nn.Linear(500, 100),
nn.BatchNorm1d(100),
nn.ReLU(),
nn.Linear(100, 10),
nn.BatchNorm1d(10),
nn.ReLU(),
)
``````

After that, I combine these new $$10$$ features with $$2000$$ features I already have, and use similar architecture on top of them:

``````  nn.Sequential(
nn.Linear(10 + n_features, 1000),
nn.BatchNorm1d(1000),
nn.ReLU(),
nn.Linear(1000, 500),
nn.BatchNorm1d(500),
nn.ReLU(),
nn.Linear(500, 100),
nn.BatchNorm1d(100),
nn.ReLU(),
nn.Linear(100, 1),
)
``````

But as a result, my prediction is much worse compared with decision trees. Are there better architectures suited for my case?

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.