#StackBounty: #neural-network #regression #lstm #rnn #word-embeddings Understanding output of LSTM for regression

Bounty: 50

I am working with embeddings and wanted to see how feasible it is to predict some scores attached to some sequences of words. The details of the scores are not important.

Input (tokenized sentence): ('the', 'dog', 'ate', 'the', 'apple')
Output (float): 0.25

I have been following this tutorial which tries to predict part-of-speech tags of such input. In such case, the output of the system is a distribution of all possible tags for all tokens in the sequence, e.g. for three possible POS classes {'DET': 0, 'NN': 1, 'V': 2}, the output for ('the', 'dog', 'ate', 'the', 'apple') could be

tensor([[-0.0858, -2.9355, -3.5374],
        [-5.2313, -0.0234, -4.0314],
        [-3.9098, -4.1279, -0.0368],
        [-0.0187, -4.7809, -4.5960],
        [-5.8170, -0.0183, -4.1879]])

Each row is a token, the index of the highest value in a token is the best predicted POS tag.

I understand this example relatively well, so I wanted to adapt it to a regression problem. The full code is below, but I am trying to make sense of the output.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


class LSTMRegressor(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size):
        super(LSTMRegressor, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to a single output
        self.linear = nn.Linear(hidden_dim, 1)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)

        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        regression = F.relu(self.linear(lstm_out.view(len(sentence), -1)))

        return regression

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# ================================================

training_data = [
    ("the dog ate the apple".split(), 0.25),
    ("everybody read that book".split(), 0.78)

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# ================================================


model = LSTMRegressor(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix))
loss_function = nn.MSELoss()
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))

# See what the results are before training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    regr = model(inputs)


for epoch in range(100):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, target in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance

        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        target = torch.tensor(target, dtype=torch.float)

        # Step 3. Run our forward pass.
        score = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(score, target)

# See what the results are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    regr = model(inputs)


The output is:

# Before training
# After training

But I don’t understand why. I was expecting a single output. The size of the tensor is the same as the number of tokens of the input. I would, then, guess that for each step in the input, the hidden state is given. Is that correct? Does that mean that the last item in the tensor (tensor[-1], or is it the first tensor[0]?) is the final prediction? Why are all outputs given? Or lies my misunderstanding earlier in the forward-pass? Perhaps I should only feed the last item of the LSTM layer to the linear layer?

I am also interested to know how this extrapolates to bidirectional LSTMs and multilayer LSTMs, and even how this would work with GRUs (bidirectional or not).

The bounty will be given to the person who can explain why we would use the last output or the last hidden state or what the difference means from a goal-directed perspective. In addition, some information about multilayer architectures and bidirectional RNNs is welcome. For instance, is it common practice to sum or concatenate the output and hidden state of bidirectional LSTM/GRU to get your data into sensible shape? If so, how do you do it?

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.