#StackBounty: #neural-network #reinforcement-learning Temporal difference learning with a neural network

Bounty: 100

Suppose I want to train a value network $v$ via TD(0). So my TD target for a time step $t$ equals $R_{t+1} + gamma v(s_{t+1})$. If I understand correctly I just need to use mean squared error, so that $v(s_t)$ becomes closer to this target. But my network outputs values between $(-1; 1)$ and rewards are from this interval also, so the TD target lies between $(-2; 2)$. Should I scale it before apply learning? What are the consequences of not doing this i.e. training a neural network with target values from a broader interval that it’s output? Can we say anything about it from theoretical point of view?

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.