#StackBounty: #neural-networks #reinforcement-learning #tensorflow #q-learning As epsilon decays, rewards gets worse during exploitatio…

Bounty: 50

I am currently trying to write learning agent from the "Human Level Control in DRL" Paper in Tensorflow 2.0. I’ve copied the recommended hyperparameters and picked the easiest environment possible. There has to be an error in my model because the rewards decrease the more epsilon decreases.

My model is identical to the one described in the paper (at least, I think it is).

def build_model(self, agent_history_length):
        # 4 greyscaled 84*84 images as an input
        inputs = keras.Input(shape=(agent_history_length, 84, 84))
        l1 = keras.layers.Conv2D(
            filters=32,
            kernel_size=(8, 8),
            strides=(4, 4),
            activation=keras.activations.relu,
            name="cnn_1",
            padding="same",
        )(inputs)
        l2 = keras.layers.Conv2D(
            filters=64,
            kernel_size=(4, 4),
            strides=(2, 2),
            activation=keras.activations.relu,
            name="cnn_2",
            padding="same",
        )(l1)
        l3 = keras.layers.Conv2D(
            filters=64,
            kernel_size=(3, 3),
            # stride=(1, 1),
            activation=keras.activations.relu,
            name="cnn_3",
            padding="same",
        )(l2)
        flatten = keras.layers.Flatten()(l3)
        l4 = keras.layers.Dense(
            512,
            activation=keras.activations.relu,
            name="dense_1"
        )(flatten)
        # for every action, there is an output
        outputs = keras.layers.Dense(
            self.num_actions,
            activation=keras.activations.relu,
            name="output"
        )(l4)
        optimizer = keras.optimizers.RMSprop(learning_rate=self.learning_rate, rho=.99, momentum=.95)
        model = keras.Model(inputs=inputs, outputs=outputs)
        self.model = model
        # use rmsprop (as used in Human-level control through deep 
        # reinforcement learning)
        self.model.compile(optimizer=optimizer, loss=tf.keras.losses.MeanSquaredError())

My training function should be the same as described in the paper too.

def learn(self):
        # retrieve 32 samples
        s, a, r, s_, d = self.experience_memory.pop()
        if s is None:
            return

        q_vals = self.network.model.predict(s)

        # retrieve the next expected maximum reward (from target network)
        q_vals_next = self.target_network.model.predict(s_)

        # y_values
        q_target = q_vals_next.copy()

        batch_index = [x for x in range(batch_size)]

        # bellman equation
        q_target[batch_index, a] = r + gamma * np.max(q_vals_next) * (1-d)

        self.network.model.fit(s, q_target, verbose=0)

        # update the target network every 10 episodes
        if self.current_step % (episode_steps * 10) == 0:
            self.target_network.model.set_weights(self.network.model.get_weights())

Something is off and I can’t find the cause. I’d be very grateful if someone could point me in the right direction.


Get this bounty!!!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.