Suppose I want to train a value network $$v$$ via TD(0). So my TD target for a time step $$t$$ equals $$R_{t+1} + gamma v(s_{t+1})$$. If I understand correctly I just need to use mean squared error, so that $$v(s_t)$$ becomes closer to this target. But my network outputs values between $$(-1; 1)$$ and rewards are from this interval also, so the TD target lies between $$(-2; 2)$$. Should I scale it before apply learning? What are the consequences of not doing this i.e. training a neural network with target values from a broader interval that it’s output? Can we say anything about it from theoretical point of view?

