#StackBounty: #neural-network #regression #lstm #rnn #word-embeddings Understanding output of LSTM for regression

Bounty: 50

I am working with embeddings and wanted to see how feasible it is to predict some scores attached to some sequences of words. The details of the scores are not important.

Input (tokenized sentence): ('the', 'dog', 'ate', 'the', 'apple')
Output (float): 0.25

I have been following this tutorial which tries to predict part-of-speech tags of such input. In such case, the output of the system is a distribution of all possible tags for all tokens in the sequence, e.g. for three possible POS classes {'DET': 0, 'NN': 1, 'V': 2}, the output for ('the', 'dog', 'ate', 'the', 'apple') could be

tensor([[-0.0858, -2.9355, -3.5374],
        [-5.2313, -0.0234, -4.0314],
        [-3.9098, -4.1279, -0.0368],
        [-0.0187, -4.7809, -4.5960],
        [-5.8170, -0.0183, -4.1879]])

Each row is a token, the index of the highest value in a token is the best predicted POS tag.

I understand this example relatively well, so I wanted to adapt it to a regression problem. The full code is below, but I am trying to make sense of the output.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


class LSTMRegressor(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size):
        super(LSTMRegressor, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to a single output
        self.linear = nn.Linear(hidden_dim, 1)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)

        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        regression = F.relu(self.linear(lstm_out.view(len(sentence), -1)))

        return regression


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# ================================================

training_data = [
    ("the dog ate the apple".split(), 0.25),
    ("everybody read that book".split(), 0.78)
]

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# ================================================

EMBEDDING_DIM = 6
HIDDEN_DIM = 6

model = LSTMRegressor(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix))
loss_function = nn.MSELoss()
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))

# See what the results are before training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    regr = model(inputs)

    print(regr)

for epoch in range(100):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, target in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        target = torch.tensor(target, dtype=torch.float)

        # Step 3. Run our forward pass.
        score = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(score, target)
        loss.backward()
        optimizer.step()

# See what the results are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    regr = model(inputs)

    print(regr)

The output is:

# Before training
tensor([[0.0000],
        [0.0752],
        [0.1033],
        [0.0088],
        [0.1178]])
# After training
tensor([[0.6181],
        [0.4987],
        [0.3784],
        [0.4052],
        [0.4311]])

But I don’t understand why. I was expecting a single output. The size of the tensor is the same as the number of tokens of the input. I would, then, guess that for each step in the input, the hidden state is given. Is that correct? Does that mean that the last item in the tensor (tensor[-1], or is it the first tensor[0]?) is the final prediction? Why are all outputs given? Or lies my misunderstanding earlier in the forward-pass? Perhaps I should only feed the last item of the LSTM layer to the linear layer?

I am also interested to know how this extrapolates to bidirectional LSTMs and multilayer LSTMs, and even how this would work with GRUs (bidirectional or not).

The bounty will be given to the person who can explain why we would use the last output or the last hidden state or what the difference means from a goal-directed perspective. In addition, some information about multilayer architectures and bidirectional RNNs is welcome. For instance, is it common practice to sum or concatenate the output and hidden state of bidirectional LSTM/GRU to get your data into sensible shape? If so, how do you do it?


Get this bounty!!!

#StackBounty: #neural-network #statistics #recurrent-neural-net #forecast #forecasting Is an Arma model equivalent to a 1-layer Recurre…

Bounty: 50

Given a time series $f(t)$ to forecast, let us consider an Arma model of the form:
$$
f(t) = c + sum_{i=1}^p a_i f(t-i) + e(t) + sum_{j=1}^q b_j e(t-j)
$$

where $e(t)$ are the forecast errors.

On the train set, if $f(t)$ is the ground truth, then we define its estimate obtained with this model as $widetilde{f}(t) = f(t) + e(t)$.

Let $m = min(p,q)$, we can rewrite the first equation as:
$$
widetilde{f}(t) = c + sum_{i=1}^m (a_i + b_i) f(t-i) + sum_{i=m+1}^p a_i f(t-i) – sum_{j=1}^q b_j widetilde{f}(t-j)
$$

Then after reparametrization can be rewritten as:
$$
widetilde{f}(t) = c + sum_{i=1}^k c_i f(t-i) – sum_{j=1}^q b_j widetilde{f}(t-j)
$$

Which is the equation of a 1-layer recurrent neural network (RNN) without activation function.

So, are Arma models a subset of RNNs or is there a flaw in this reasoning ?


Get this bounty!!!