I’m trying to do sequence classification on an LSTM. The problem setup is that we get data from a machine which is our input. We don’t know what state the machine is in, so we perform classification to get the state. For purposes of training, this state can occur for a long period of time so my y vector can look something like
y = [0, 0, 0, 0, ...., 0, 0, 1, 1, 1, 1, 1, ...]
My network is very simple:
class Network(nn.Module): def __init__(self): super(Network, self).__init__() self.lstm = nn.LSTM(input_size=7, hidden_units=7) self.classifier = nn.Linear(7, 5) self.softmax = nn.Softmax() def forward(self, x, hidden): x, hidden = self.lstm(x, hidden) return self.softmax(self.classifier(x)), hidden
When I shuffle all my data observations the LSTM network works, but not as good as an MLP. So I was hoping to utilize the inherent capability of LSTM to keep track of hidden state and not shuffle my data. However, when I try this my gradients vanish and my network weights almost all become zero. This happens just after the first batch because the y vector for the first batch looks just like
[0, 0, 0, 0, 0, 0, ..., 0]
Increasing batch size to become very large (like 8192) improves the situation slightly but I just barely break .5 for an f1-score. Changing network configuration doesn’t help.
My main questions are: What is causing my gradients to vanish (I thought LSTM was supposed to prevent this)? How do I run my problem while maintaining state across batches without having my gradients vanish?