We want to solve a regression problem of the form "given two objects $x$ and $y$, predict their score (think about it as a similarity) $w(x,y)$". We have 2 types of features:
- For each object, we have about 1000 numerical features, mainly of the following types: 1) "Historical score info", e.g. historical means $w(x,cdot)$ up to the point we use the feature; 2) 0/1 features meaning whether object $x$ has a particular attribute, etc.
- For each object, we have a text which describes the object (description is not reliable, but still useful).
Clearly, when predicting a score for a pair $(x,y)$, we can use features for both $x$ and $y$.
We are currently using the following setup (I omit validation/testing):
- For texts, we compute their BERT embeddings and then produce a feature based on the similarity between the embedding vectors (e.g. cosine similarity between them).
- We split the dataset into fine-tuning and training datasets. The fine-tuning dataset may be empty meaning no fine-tuning.
- Using the fine-tuning dataset, we fine-tune BERT embeddings.
- Using the training dataset, we train decision trees to predict the scores.
We compare the following approaches:
- Without BERT features.
- Using BERT features, but without fine-tuning. There is some reasonable improvement in prediction accuracy.
- Using BERT features, with fine-tuning. The improvement is very small (but the prediction using only BERT features improved, of course).
Question: Is there something simple I’m missing in this approach? E.g. maybe there are better ways to use texts? Other ways to use embeddings? Better approaches compared with decision trees?
I tried to do multiple things, without any success. The approaches which I expected to provide improvements are the following:
Fine-tune embeddings to predict difference between $w(x,y)$ and mean $w(x, cdot)$. The motivation is that we already have a feature "mean $w(x,cdot)$", which is a baseline for an object $x$, and we are interested in the deviation from this mean.
Use NN instead of decision trees. Namely, I use few dense layers to turn embedding vectors into features, like this:
nn.Sequential( nn.Linear(768 * 2, 1000), nn.BatchNorm1d(1000), nn.ReLU(), nn.Linear(1000, 500), nn.BatchNorm1d(500), nn.ReLU(), nn.Linear(500, 100), nn.BatchNorm1d(100), nn.ReLU(), nn.Linear(100, 10), nn.BatchNorm1d(10), nn.ReLU(), )
After that, I combine these new $10$ features with $2000$ features I already have, and use similar architecture on top of them:
nn.Sequential( nn.Linear(10 + n_features, 1000), nn.BatchNorm1d(1000), nn.ReLU(), nn.Linear(1000, 500), nn.BatchNorm1d(500), nn.ReLU(), nn.Linear(500, 100), nn.BatchNorm1d(100), nn.ReLU(), nn.Linear(100, 1), )
But as a result, my prediction is much worse compared with decision trees. Are there better architectures suited for my case?