All code contained in this post can be found on this site’s Github repository here.

A policy gradient-based method of reinforcement learning selection agent actions based on the output of a neural network, with each output corresponding to the probability that a certain action should be taken. This probability distribution is sampled from to produce actions during training.

The A2c algorithm within the family of policy gradient-based reinforcement learning methods has a policy gradient term equal to:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)A(s_t, a_t)$$

It should be noted that the above term is for gradient *ascent*, not descent i.e. this term should be maximized. Recall that the advantage is expressed as:

$$A(s_t, a_t) = Q(s_t, a_t) – V(s_t)$$

Which is a kind of “relative” or normalized benefit of taking action $a_t$ from state $s_t$, with the normalization arising from the overall value of being in that state $V(s_t)$ (for further discussion on the advantage and value estimates, see this post). This loss weights the log probability of taking the action with the advantage. The probability of taking the action is generated from the policy output of the neural network, with a softmax activation applied. So if an action is taken with high probability, let’s say 0.8, and this results in a high advantage – then this policy output will be reinforced. If the advantage is low, alternatively, then this high probability policy output will be discouraged.

Via Bellman’s equation, the advantage can actually be estimated purely from the value estimate:

$$A(s_t, a_t) = r_{t+1} + \gamma V(s_{t+1}) – V(s_t)$$

This enables us to create an advantaged weighted policy gradient method by also estimating the value of any given state $V(s_t)$. This requires a two-output neural network, one for estimating action probabilities (actor part) and the other for state values (critic part), with architecture as shown below:

The architecture shown below is for a network whose state inputs are game pixels. For other state types, such as the state variables for the Cartpole environment, the input layers are usually densely connected. The loss function for the value output in the A2C architecture is usually simply the mean-squared error between the predicted values and the actual values calculated from the discounted future rewards.

**What’s the disadvantage with the A2C method?**

There are a couple of disadvantages of the A2C method. These can be summarised as follows:

- Large gradient steps can move the network weights into sub-optimal areas and wreck (or at least slow) the training progress
- There is poor sample efficiency – the batch samples can only inform the network training once

These disadvantages were addressed in the Trust Region Policy Optimization (TRPO) method, which ultimately leads to the PPO method, and will be discussed in the following section in that context.

There are some very interesting ideas in TRPO, however, in this post, I will only be covering them briefly without too much theory.

The first problem to address is the poor sample efficiency of A2C (and other Policy Gradient methods). A set of training samples are collected by playing the game, which in reinforcement learning can be a relatively lengthy process, and they are only used to train the network *once*. This is not great from a computational perspective, especially given that reinforcement learning may require hundreds of thousands or millions of training steps to achieve good results. In policy gradient-based methods, ideally we would like to use samples generated via the rolling out of a trajectory using an older version of the policy, to train a new policy.

But how can we do this in a mathematically valid way? It turns out we can use the concept of Importance Sampling (IS). Importance sampling allows us to compute the expected value of function $f(x)$, where $x$ is sampled from a distribution $p(x)$, by instead sampling $x$ from another distribution $q(x)$. That’s probably a bit confusing, but this is what it looks like mathematically:

$$\mathbb{E}_{x \sim p(x)} \left[f(x)\right] = \mathbb{E}_{x \sim q(x)} \left[\frac{p(x)}{q(x)} f(x) \right]$$

Where $x \sim p(x)$ in the expectation subscript refers to $x$ being sampled from $p(x)$ and the same goes for $x \sim q(x)$. We can apply this formula to the case where we might want to use actions, states and advantage values sampled during the roll-out of an old policy $\pi_{\theta old}$ to train the current policy. First recall that the policy gradient part of the loss for A2C, with the expectation still shown (before it is cashed out into an empirical sum sampled from a trajectory), looks like:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_{\theta} log P_{\pi_{\theta}}(a_t|s_t) A(s_t, a_t)\right]$$

Now, using importance sampling, we can replace this with:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[\frac{\nabla_{\theta} P_{\pi_{\theta}}(a_t|s_t)}{P_{\pi_{\theta old}}(a_t|s_t)} A_{\theta old}(s_t, a_t)\right]$$

You may be wondering where the $log$ went in the formula above, however it turns out that the derivative of both of these expressions are equivalent for small step sizes, as discussed in the original TRPO paper.

This use of importance sampling now allows the same batch of samples from a trajectory rollout to be used many times to train the current policy, making the policy gradient method much more sample efficient.

As mentioned above, one of the issues of the A2C method of reinforcement learning is that it is unstable. In other words, the gradient steps during training can derail the agent’s learning process. Consider the image below of the Hilary Step on Mount Everest:

Imagine the training process of the agent is to ascend the very narrow path to the summit, which represents optimal agent behavior. However, as can be observed, the path is very narrow, and a step too large in any direction will send the agent “over the cliff” perhaps permanently damaging the training process for the agent. This is perhaps an extreme example, though maybe not – reinforcement learning optimization landscapes are infamously sharp and “treacherous”. In any case, this fact restricts both the speed of the training and the learning rates that are possible during training. If the learning rates or steps are too high, then there will be a greater risk of derailing the training process.

This is especially a problem if we are hoping to improve sample efficiency and run multiple training steps over the same set of trajectory samples. This is where the “trust region” part of TRPO comes in. Trust region optimization involves creating a small area around the current point which specifies the maximum radius of a step that can be taken in the next gradient descent/ascent iteration. Consider the two diagrams below, from this good overview of trust-region methods:

The dark circles around the points in the diagrams above are the trust regions, which define the boundaries of the maximum next step in whatever gradient method is being used. The trust region is either reduced or expanded depending on how well the last step performed – if it resulted in a significant improvement, the trust region size is expanded, if it resulted in a degradation in performance, the trust region size is reduced. However, the trust region size can also be a simple fixed hyper-parameter, as is the case in TRPO as will be shown below.

The optimization problem solved in TRPO is approximately as follows:

$$J(\theta) = \mathbb{E} \left[\frac{P_{\pi_{\theta}}(a_t|s_t)}{P_{\pi_{\theta old}}(a_t|s_t)} A_{\theta old}(s_t, a_t)\right]$$

With an optimization constraint of:

$$D_{KL}(\theta | \theta_{old}) \leq \delta $$

Here $D_{KL}$ is the Kullback–Leibler divergence (KL divergence). The KL divergence is a way of measuring the distance between two probability distributions, so this constraint ensures that the new parameters of the policy aren’t too different from the old parameters. In other words, this constraint enforces a trust region on the parameter / policy updates.

In the original paper, the conjugate gradient optimization method is used to solve this optimization problem with the aforementioned $D_{KL}(\theta | \theta_{old})$ based constraint. However, compared to the standard optimization methods commonly used in deep learning, the conjugate gradient is slow. Therefore, an improvement of TRPO that removes the need for strict constraints and the use of conjugate gradient optimization, while still keeping both the trust region concept and enabling the improvement of sample efficiency, would clearly be desirable. Proximal Policy Optimization (PPO) is one way of achieving these goals.

As shown above, the importance sampling version of the policy loss function, which enables better sample efficiency, is expressed as follows in the original PPO paper:

$$L^{CPI}(\theta) = \mathbb{E}_t \left[\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}A_t\right] = \mathbb{E}_t\left[r_t(\theta)A_t\right]$$

Here the superscipt *CPI *stands for “conservative policy iteration”, and the $\pi_\theta (a_t | s_t)$ notation expresses the same probability as shown previously (i.e. $P_{\pi_{\theta}}(a_t|s_t)$). This loss function, while allowing greater sample efficiency by using importance sampling, as discussed above, will still be subject to large gradient updates derailing training progress. In order to introduce a quasi-trust region limitation, while not introducing conjugate gradient-based optimization or strict constraints, the authors of the PPO paper introduce the following alternative loss function:

$$L^{CLIP}(\theta) = \mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon)A_t)\right]$$

Where $\epsilon$ is a hyperparameter recommended to be between 0.1 to 0.3. This CLIP loss function is essentially the same as the CPI loss function, apart from when the probability of an action under the new parameters ($\pi_\theta (a_t | s_t)$) is greater than the old parameters by a certain degree (i.e. r > 1 + $\epsilon$), after which the loss value is clipped. This ensures that no large steps can occur in the gradient updates, which is essentially enforcing a trust region. The plots below for both positive and negative advantage values illustrate the behavior of $L^{CLIP}$ with respect to $r$:

The total loss function for the PPO method consists of $L^{CLIP}$, the mean-square error loss of the value estimator, and a term to encourage higher entropy / greater exploration (with the latter two terms identical to the components in A2C). It can be expressed as follows, with signs introduced to convert a maximization function to a minimization function:

$$L_{TOTAL} = -L^{CLIP} + L^{VALUE} * k_1 – L^{ENTROPY} * k_2$$

Where the $k$ values are hyperparameters that weigh the various components of the loss against $L^{CLIP}$ (with $k_1$ usually on the order of 0.5, and $k_2$ on the order of 0.01). In the following section, I will present a code work-through of PPO being utilized to train a CartPole agent.

As stated above, all code in this walk-through can be found on this site’s Github repository. The RL environment that will be used is Open AI’s Cartpole environment. In this example, I’ll be making use of TensorFlow’s GradientTape functionality and the Keras model API.

The code below shows the establishment of some constants and the creation of the neural network model structure:

import tensorflow as tf from tensorflow import keras import tensorflow_probability as tfp import numpy as np import gym import datetime as dt STORE_PATH = 'C:\\Users\\andre\\TensorBoard\\PPOCartpole' CRITIC_LOSS_WEIGHT = 0.5 ENTROPY_LOSS_WEIGHT = 0.01 ENT_DISCOUNT_RATE = 0.995 BATCH_SIZE = 64 GAMMA = 0.99 CLIP_VALUE = 0.2 LR = 0.001 NUM_TRAIN_EPOCHS = 10 env = gym.make("CartPole-v0") state_size = 4 num_actions = env.action_space.n ent_discount_val = ENTROPY_LOSS_WEIGHT class Model(keras.Model): def __init__(self, num_actions): super().__init__() self.num_actions = num_actions self.dense1 = keras.layers.Dense(64, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.dense2 = keras.layers.Dense(64, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.value = keras.layers.Dense(1) self.policy_logits = keras.layers.Dense(num_actions) def call(self, inputs): x = self.dense1(inputs) x = self.dense2(x) return self.value(x), self.policy_logits(x) def action_value(self, state): value, logits = self.predict_on_batch(state) dist = tfp.distributions.Categorical(logits=logits) action = dist.sample() return action, value

Most of the constants declared in the first part of the code above are familiar from the A2C tutorial – the critic and entropy loss weight values, gamma value (reward discounting factor), and so on. However, a few new values need to be discussed. First, I have included an entropy discounting rate (ENT_DISCOUNT_RATE) – after each episode, the entropy loss weight value will be discounted by this value to discourage exploration as the number of episodes played piles up. The constant CLIP_VALUE refers to the value $\epsilon$ in the $L^{CLIP}$ loss function ($\mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon)A_t)\right]$). The value NUM_TRAIN_EPOCHS refers to the number of times a given trajectory of rewards, actions, states, etc. will be used to train the network, making use of the importance sampling configuration of the loss function.

The Model class is a fairly standard Keras model API implementation, with two common dense layers and a separate value/critic output and a policy logits output. For more details on this model implementation, see the A2C tutorial.

Next, we’ll review the custom loss functions:

def critic_loss(discounted_rewards, value_est): return tf.cast(tf.reduce_mean(keras.losses.mean_squared_error(discounted_rewards, value_est)) * CRITIC_LOSS_WEIGHT, tf.float32)

First the critic loss – this loss function calculates the mean squared error between the discounted rewards (extracted by discounting the trajectory of rewards collated by running the agent in the environment) and the predicted values (*value_est*) from the neural network from the trajectory of states.

Next, we look at the entropy loss:

def entropy_loss(policy_logits, ent_discount_val): probs = tf.nn.softmax(policy_logits) entropy_loss = -tf.reduce_mean(keras.losses.categorical_crossentropy(probs, probs)) return entropy_loss * ent_discount_val

In this loss function, the policy logits directly from the neural network (and based on the trajectory of states sampled from the environment) and the discounted entropy weight (*ent_discount_val*) are passed as arguments. The policy logits are then converted to quasi-probabilities by applying the softmax function, and then the entropy calculation is performed by applying the Keras categorical cross-entropy function on *probs* twice (again, for an explanation of how this works, see here). Notice the minus sign to invert the entropy calculation – in a minimization optimization calculation, this will act to enhance the entropy / exploration of the agent. The discounted entropy weight value is then applied to the entropy and the product is returned.

def actor_loss(advantages, old_probs, action_inds, policy_logits): probs = tf.nn.softmax(policy_logits) new_probs = tf.gather_nd(probs, action_inds) ratio = new_probs / old_probs policy_loss = -tf.reduce_mean(tf.math.minimum( ratio * advantages, tf.clip_by_value(ratio, 1.0 - CLIP_VALUE, 1.0 + CLIP_VALUE) * advantages )) return policy_loss

The actor loss takes in the advantages (calculated in another function, which will be explained as follows), the old probability values (again, calculated elsewhere), the action indices, and the policy logits from the neural network model. The action indices are the array indices of the actions actually taken during the trajectory roll-out. In other words, if, for a certain state in the game, the action probabilities are [0.01, 0.3, 0.5, 0.2] and the 0.5 action is taken, the action index, in this case, is 2 (with a zero-based index system). The *action_inds* is a tensor indices which correspond to the action taken for every record in the batch.

The first line of the function is where the policy logit values are converted to quasi-probabilities using the softmax function. These new policy logits are generated by using the most recent neural network parameters – in other words, after the softmax has been applied, they refer to the $\pi_\theta (a_t | s_t)$ values in the $L^{CLIP}$ loss function. The *old_probs* values refer to the $\pi_{\theta old}(a_t | s_t)$ values. We are only after the action probabilities that correspond to the actions actually taken during the agent’s playing of the game. The tf.gather_nd function, therefore, takes the *probs* and extracts those probability outputs which correspond to the actions that were actually taken, specified by the *action_inds*. This *action_inds* based selection has already been performed on the probabilities from the neural network parameterized by the old network weight values $\theta_{old}$, as will be shown later.

Then the *r* value (i.e. the $\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}$ value) is computed. After this, the clipping function is applied as per the $L^{CLIP}$ calculation, shown again below for reference:

$$L^{CLIP}(\theta) = \mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon)A_t)\right]$$

Notice the negative sign in the calculation of *policy_loss* – this ensures that $L^{CLIP}$ is maximized during the minimization optimization that will be performed in the TensorFlow environment.

Next, I introduce the *train_model* function which calls the various loss function and makes the gradient step:

def train_model(action_inds, old_probs, states, advantages, discounted_rewards, optimizer, ent_discount_val): with tf.GradientTape() as tape: values, policy_logits = model.call(tf.stack(states)) act_loss = actor_loss(advantages, old_probs, action_inds, policy_logits) ent_loss = entropy_loss(policy_logits, ent_discount_val) c_loss = critic_loss(discounted_rewards, values) tot_loss = act_loss + ent_loss + c_loss grads = tape.gradient(tot_loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) return tot_loss, c_loss, act_loss, ent_loss

This function takes advantage of the tf.GradientTape() context manager – for more details, see this post. Within this context, the model is called on the list of states collected during the unrolling of a trajectory of an agent through the game. The *states* variable will be a list of input states, of length equal to BATCH_SIZE. The model returns value and policy logit estimates. These values are then passed through to the various loss functions and summated. The gradients of the trainable variables with respect to this total loss value are then calculated, and an optimization step is undertaken using the *apply_gradients* method.

The next function calculates the discounted rewards and the advantages:

def get_advantages(rewards, dones, values, next_value): discounted_rewards = np.array(rewards + [next_value[0]]) for t in reversed(range(len(rewards))): discounted_rewards[t] = rewards[t] + GAMMA * discounted_rewards[t+1] * (1-dones[t]) discounted_rewards = discounted_rewards[:-1] # advantages are bootstrapped discounted rewards - values, using Bellman's equation advantages = discounted_rewards - np.stack(values)[:, 0] # standardise advantages advantages -= np.mean(advantages) advantages /= (np.std(advantages) + 1e-10) # standardise rewards too discounted_rewards -= np.mean(discounted_rewards) discounted_rewards /= (np.std(discounted_rewards) + 1e-8) return discounted_rewards, advantages

This function has been discussed in previous tutorials, for instance, this post. In this case, both the advantages (used in the actor loss function) and the discounted rewards (used in the critic loss function) are normalized to improve training stability.

model = Model(num_actions) optimizer = keras.optimizers.Adam(learning_rate=LR) train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PPO-CartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}") num_steps = 10000000 episode_reward_sum = 0 state = env.reset() episode = 1 total_loss = None for step in range(num_steps): rewards = [] actions = [] values = [] states = [] dones = [] probs = [] for _ in range(BATCH_SIZE): _, policy_logits = model(state.reshape(1, -1)) action, value = model.action_value(state.reshape(1, -1)) new_state, reward, done, _ = env.step(action.numpy()[0]) actions.append(action) values.append(value[0]) states.append(state) dones.append(done) probs.append(policy_logits) episode_reward_sum += reward state = new_state if done: rewards.append(0.0) state = env.reset() if total_loss is not None: print(f"Episode: {episode}, latest episode reward: {episode_reward_sum}, " f"total loss: {np.mean(total_loss)}, critic loss: {np.mean(c_loss)}, " f"actor loss: {np.mean(act_loss)}, entropy loss {np.mean(ent_loss)}") with train_writer.as_default(): tf.summary.scalar('rewards', episode_reward_sum, episode) episode_reward_sum = 0 episode += 1 else: rewards.append(reward) _, next_value = model.action_value(state.reshape(1, -1)) discounted_rewards, advantages = get_advantages(rewards, dones, values, next_value[0]) actions = tf.squeeze(tf.stack(actions)) probs = tf.nn.softmax(tf.squeeze(tf.stack(probs))) action_inds = tf.stack([tf.range(0, actions.shape[0]), tf.cast(actions, tf.int32)], axis=1) total_loss = np.zeros((NUM_TRAIN_EPOCHS)) act_loss = np.zeros((NUM_TRAIN_EPOCHS)) c_loss = np.zeros(((NUM_TRAIN_EPOCHS))) ent_loss = np.zeros((NUM_TRAIN_EPOCHS)) for epoch in range(NUM_TRAIN_EPOCHS): loss_tuple = train_model(action_inds, tf.gather_nd(probs, action_inds), states, advantages, discounted_rewards, optimizer, ent_discount_val) total_loss[epoch] = loss_tuple[0] c_loss[epoch] = loss_tuple[1] act_loss[epoch] = loss_tuple[2] ent_loss[epoch] = loss_tuple[3] ent_discount_val *= ENT_DISCOUNT_RATE with train_writer.as_default(): tf.summary.scalar('tot_loss', np.mean(total_loss), step) tf.summary.scalar('critic_loss', np.mean(c_loss), step) tf.summary.scalar('actor_loss', np.mean(act_loss), step) tf.summary.scalar('entropy_loss', np.mean(ent_loss), step)

The majority of this trajectory roll-out / training loop has been covered in this post. The main difference is the loop over NUM_TRAIN_EPOCHS:

for epoch in range(NUM_TRAIN_EPOCHS): loss_tuple = train_model(action_inds, tf.gather_nd(probs, action_inds), states, advantages, discounted_rewards, optimizer, ent_discount_val) total_loss[epoch] = loss_tuple[0] c_loss[epoch] = loss_tuple[1] act_loss[epoch] = loss_tuple[2] ent_loss[epoch] = loss_tuple[3]

In this section of the code, the *train_model* function is run NUM_TRAIN_EPOCHS times. This can be performed in PPO using the importance sampling functionality i.e.:

$$L^{CPI}(\theta) = \mathbb{E}_t \left[\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}A_t\right] = \mathbb{E}_t\left[r_t(\theta)A_t\right]$$

In this case, *probs* are the probabilities extracted from the model over the rolling out of the trajectory by the agent. In other words, they are the probabilities generated from the neural network with the parameters of the network based on the previous round of training. Therefore, these are the *old *probabilities or $\pi_{\theta old}(a_t | s_t)$ in the formula above. The *new *probabilities $\pi_{\theta}(a_t | s_t)$ are generated within the gradient tape context of *train_model *– a new set of parameters for each training loop. However, through each of these iterations, the old probabilities, or *probs, *stay constant.

Running this code on the CartPole environment yields the following rewards during training:

As can be observed, the agent manages to score the maximum Cartpole reward of 200 quite early on in the training process, however, it takes some time to achieve this result consistently. Further fine-tuning of the entropy reduction factor, or other hyperparameters, would likely yield a more consistent high score sooner.

This concludes my introductory post on the Proximal Policy Optimization (PPO) method and its implementation in TensorFlow 2. I hope it was informative for you.

]]>All code shown in this tutorial can be found at this site’s Github repository, in the ac2_tf2_cartpole.py file.

In the A2C algorithm, notice the title “Advantage Actor” – this refers first to the actor, the part of the neural network that is used to determine the actions of the agent. The “advantage” is a concept that expresses the *relative* benefit of taking a certain action at time *t* ($a_t$) from a certain state $s_t$. Note that it is not the “absolute” benefit, but the “relative” benefit. This will become clearer when I discuss the concept of “value”. The advantage is expressed as:

$$A(s_t, a_t) = Q(s_t, a_t) – V(s_t)$$

The Q value (discussed in other posts, for instance here, here and here) is the expected future rewards of taking action $a_t$ from state $s_t$. The value $V(s_t)$ is the expected value of the agent being in that state and operating under a certain action policy $\pi$. It can be expressed as:

$$V^{\pi}(s) = \mathbb{E} \left[\sum_{i=1}^T \gamma^{i-1}r_{i}\right]$$

Here $\mathbb{E}$ is the expectation operator, and the value $V^{\pi}(s)$ can be read as the expected value of future discounted rewards that will be gathered by the agent, operating under a certain action policy $\pi$. So, the Q value is the expected value of taking a certain action from the current state, whereas V is the expected value of *simply being* in the current state, under a certain action policy.

The *advantage* then is the *relative* benefit of taking a certain action from the current state. It’s kind of like a normalized Q value. For example, let’s consider the last state in a game, where after the next action the game ends. There are three possible actions from this state, with rewards of (51, 50, 49). Let’s also assume that the action selection policy $\pi$ is simply random, so there is an equal chance of any of the three actions being selected. The value of this state, then, is 50 ((51+50+49) / 3). If the first action is randomly selected (reward=51), the Q value is 51. However, the *advantage* is only equal to 1 (Q-V = 51-50). As can be observed and as stated above, the advantage is a kind of normalized or relative Q value.

Why is this important? If we are using Q values in some way to train our action-taking policy, in the example above the first action would send a “signal” or contribution of 51 to the gradient optimizer, which may be significant enough to push the parameters of the neural network significantly in a certain direction. However, given the other two actions possible from this state also have a high reward (50 and 49), the signal or contribution is really higher than it should be – it is not that much better to take action 1 instead of action 3. Therefore, Q values can be a source of high *variance* in the training process, and it is much better to use the normalized or baseline Q values i.e. the advantage, in training. For more discussion of Q, values, and advantages, see my post on dueling Q networks.

In a previous post, I presented the policy gradient reinforcement learning algorithm. For details on this algorithm, please consult that post. However, the A2C algorithm shares important similarities with the PG algorithm, and therefore it is necessary to recap some of the theory. First, it has to be recalled that PG-based algorithms involve a neural network that directly outputs estimates of the probability distribution of the best next action to take in a given state. So, for instance, if we have an environment with 4 possible actions, the output from the neural network could be something like [0.5, 0.25, 0.1, 0.15], with the first action being currently favored. In the PG case, then, the neural network is the direct instantiation of the policy of the agent $\pi_{\theta}$ – where this policy is controlled by the parameters of the neural network $\theta$. This is opposed to Q based RL algorithms, where the neural network estimates the Q value in a given state for each possible action. In these algorithms, the action policy is generally an epsilon-greedy policy, where the best action is that action with the highest Q value (with some random choices involved to improve exploration).

The gradient of the loss function for the policy gradient algorithm is as follows:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)\left(\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’} \right)$$

Note that the term:

$$G_t = \left(\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’} \right)$$

Is just the discounted sum of the rewards onwards from state $s_t$. In other words, it is an estimate of the true value function $V^{\pi}(s)$. Remember that in the PG algorithm, the network can only be trained after each full episode, and this is because of the term above. Therefore, note that the $G_t$ term above is an *estimate *of the true value function as it is based on only a single trajectory of the agent through the game.

Now, because it is based on samples of reward trajectories, which aren’t “normalized” or baselined in any way, the PG algorithm suffers from variance issues, resulting in slower and more erratic training progress. A better solution is to replace the $G_t$ function above with the Advantage – $A(s_t, a_t)$, and this is what the Advantage-Actor Critic method does.

Replacing the $G_t$ function with the advantage, we come up with the following gradient function which can be used in training the neural network:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)A(s_t, a_t)$$

Now, as shown above, the advantage is:

$$A(s_t, a_t) = Q(s_t, a_t) – V(s_t)$$

However, using Bellman’s equation, the Q value can be expressed purely in terms of the rewards and the value function:

$$Q(s_t, a_t) = \mathbb{E}\left[r_{t+1} + \gamma V(s_{t+1})\right]$$

Therefore, the advantage can now be estimated as:

$$A(s_t, a_t) = r_{t+1} + \gamma V(s_{t+1}) – V(s_t)$$

As can be seen from the above, there is a requirement to be able to estimate the value function V. We could estimate it by running our agents through full episodes, in the same way we did in the policy gradient method. However, it would be better to be able to just collect batches of game-steps and train whenever the batch buffer was full, rather than having to wait for an episode to finish. That way, the agent could actually learn “on-the-go” during the middle of an episode/game.

So, do we build another neural network to estimate V? We could have two networks, one to learn the policy and produce actions, and another to estimate the state values. A more efficient solution is to create one network, but with two output channels, and this is how the A2C method is outworked. The figure below shows the network architecture for an A2C neural network:

This architecture is based on an A2C method that takes game images as the state input, hence the convolutional neural network layers at the beginning of the network (for more on CNNs, see my post here). This network architecture also resembles the Dueling Q network architecture (see my Dueling Q post). The point to note about the architecture above is that most of the network is shared, with a late bifurcation between the policy part and the value part. The outputs $P(s, a_i)$ are the action probabilities of the policy (generated from the neural network) – $P(a_t|s_t)$. The other output channel is the value estimation – a scalar output which is the predicted value of state s – $V(s)$. The two dense channels disambiguate the policy and the value outputs from the front-end of the neural network.

In this example, we’ll just be demonstrating the A2C algorithm on the Cartpole OpenAI Gym environment which doesn’t require a visual state input (i.e. a set of pixels as the input to the NN), and therefore the two output channels will simply share some dense layers, rather than a series of CNN layers.

There are actually three loss values that need to be calculated in the A2C algorithm. Each of these losses is in practice given a weighting, and then they are summed together (with the entropy loss having a negative sign, see below).

The loss function of the Critic i.e. the value estimating output of the neural network $V(s)$, needs to be trained so that it predicts more and more closely the actual value of the state. As shown before, the value of a state is calculated as:

$$V^{\pi}(s) = \mathbb{E} \left[\sum_{i=1}^T \gamma^{i-1}r_{i}\right]$$

So $V^{\pi}(s)$ is the expected value of the discounted future rewards obtained by outworking a trajectory through the game based on a certain operating policy $\pi$. We can therefore compare the predicted $V(s)$ at each state in the game, and the actual *sampled* discounted rewards that were gathered, and the difference between the two is the Critic loss. In this example, we’ll use a mean squared error function as the loss function, between the discounted rewards and the predicted values ($(V(s) – DR)^2$).

Now, given that, under the A2C algorithm, we collect state, action and reward tuples until a batch buffer is filled, how are we meant to figure out this discounted rewards sum? Let’s say we progress 3 states through a game, and we collect:

$(V(s_0), r_0), (V(s_1), r_1), (V(s_2), r_2)$

For the first Critic loss, we could calculate it as:

$$MSE(V(s_0), r_0 + \gamma r_1 + \gamma^2 r_2)$$

But that is missing all the following rewards $r_3, r_4, …., r_n$ until the game terminates. We didn’t have this problem in the Policy Gradient method, because in that method, we made sure a full run through the game had completed before training the neural network. In the A2c method, we use a trick called *bootstrapping.* To replace all the discounted $r_3, r_4, …., r_n$ values, we get the network to estimate the value for state 3, $V(s_3)$, and this will be an estimate for all the discounted future rewards beyond that point in the game. So, for the first Critic loss, we would have:

$$MSE(V(s_0), r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 V(s_3))$$

Where $V(s_3)$ is a *bootstrapped* estimate of the value of the next state $s_3$.

This will be explained more in the code-walkthrough to follow.

The second loss function needs to train the Actor (i.e. the action policy). Recall that the advantage weighted policy loss is:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)A(s_t, a_t)$$

Let’s start with the advantage – $A(s_t, a_t) = r_{t+1} + \gamma V(s_{t+1}) – V(s_t)$

This is simply the bootstrapped discounted rewards minus the predicted state values $V(s_t)$ that we gathered up while playing the game. So calculating the advantage is quite straight-forward, once we have the bootstrapped discounted rewards, as will be seen in the code walk-through shortly.

Now, with regards to the $log P_{\pi_{\theta}}(a_t|s_t)$ statement, in this instance, we can just calculate the log of the softmax probability estimate for whatever action was taken. So, for instance, if in state 1 ($s_1$) the network softmax output produces {0.1, 0.9} (for a 2-action environment), and the second action was actually taken by the agent, we would want to calculate log(0.9). We can make use of the TensorFlow-Keras SparseCategoricalCrossEntropy calculation, which takes the action as an integer, and this specifies which softmax output value to apply the log to. So in this example, y_pred = [1] and y_target = [0.1, 0.9] and the answer would be -log(0.9) = 0.105.

Another handy feature with the SpareCategoricalCrossEntropy loss in Keras is that it can be called with a “sample_weight” argument. This basically multiplies the log calculation with a value. So, in this example, we can supply the advantages as the sample weights, and it will calculate $\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)A(s_t, a_t)$ for us. This will be shown below, but the call will look like:

policy_loss = sparse_ce(actions, policy_logits, sample_weight=advantages)

In many implementations of the A2c algorithm, another loss term is *subtracted* – the entropy loss. Entropy is a measure, broadly speaking, of randomness. The higher the entropy, the more random the state of affairs, the lower the entropy, the more ordered the state of affairs. In the case of A2c, entropy is calculated on the softmax policy action ($P(a_t|s_t)$) output of the neural network. Let’s go back to our two action example from above. In the case of a probability output of {0.1, 0.9} for the two possible actions, this is an ordered, less-random selection of actions. In other words, there will be a consistent selection of action 2, and only rarely will action 1 be taken. The entropy formula is:

$$E = -\sum p(x) log(p(x))$$

So in this case, the entropy of that output would be 0.325. However, if the probability output was instead {0.5, 0.5}, the entropy would be 0.693. The 50-50 action probability distribution will produce more random actions, and therefore the entropy is higher.

By subtracting the entropy calculation from the total loss (or giving the entropy loss a negative sign), it encourages *more *randomness and therefore *more exploration*. The A2c algorithm can have a tendency of converging on particular actions, so this subtraction of the entropy encourages a better exploration of alternative actions, though making the weighting on this component of the loss too large can also reduce training performance.

Again, we can use an already existing Keras loss function to calculate the entropy. The Keras categorical cross-entropy performs the following calculation:

If we just pass in the probability outputs as both *target* and *output* to this function, then it will calculate the entropy for us. This will be shown in the code below.

The total loss function for the A2C algorithm is:

Loss = Actor Loss + Critic Loss * CRITIC_WEIGHT – Entropy Loss * ENTROPY_WEIGHT

A common value for the critic weight is 0.5, and the entropy weight is usually quite low (i.e. on the order of 0.01-0.001), though these hyperparameters can be adjusted and experimented with depending on the environment and network.

In the following section, I will provide a walk-through of some code to implement the A2C methodology in TensorFlow 2. The code for this can be found at this site’s Github repository, in the ac2_tf2_cartpole.py file.

First, we perform the usual imports, set some constants, initialize the environment and finally create the neural network model which instantiates the A2C architecture:

import tensorflow as tf from tensorflow import keras import numpy as np import gym import datetime as dt STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard/A2CCartPole' CRITIC_LOSS_WEIGHT = 0.5 ACTOR_LOSS_WEIGHT = 1.0 ENTROPY_LOSS_WEIGHT = 0.05 BATCH_SIZE = 64 GAMMA = 0.95 env = gym.make("CartPole-v0") state_size = 4 num_actions = env.action_space.n class Model(keras.Model): def __init__(self, num_actions): super().__init__() self.num_actions = num_actions self.dense1 = keras.layers.Dense(64, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.dense2 = keras.layers.Dense(64, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.value = keras.layers.Dense(1) self.policy_logits = keras.layers.Dense(num_actions) def call(self, inputs): x = self.dense1(inputs) x = self.dense2(x) return self.value(x), self.policy_logits(x) def action_value(self, state): value, logits = self.predict_on_batch(state) action = tf.random.categorical(logits, 1)[0] return action, value

As can be seen, for this example I have set the critic, actor and entropy loss weights to 0.5, 1.0 and 0.05 respectively. Next the environment is setup, and then the model class is created.

This class inherits from keras.Model, which enables it to be integrated into the streamlined Keras methods of training and evaluating (for more information, see this Keras tutorial). In the initialization of the class, we see that 2 dense layers have been created, with 64 nodes in each. Then a value layer with one output is created, which evaluates $V(s)$, and finally the policy layer output with a size equal to the number of available actions. Note that this layer produces logits only, the softmax function which creates pseudo-probabilities ($P(a_t, s_t)$) will be applied within the various TensorFlow functions, as will be seen.

Next, the *call* function is defined – this function is run whenever a state needs to be “run” through the model, to produce a value and policy logits output. The Keras model API will use this function in its *predict *functions and also its training functions. In this function, it can be observed that the input is passed through the two common dense layers, and then the function returns first the value output, then the policy logits output.

The next function is the *action_value* function. This function is called upon when an action needs to be chosen from the model. As can be seen, the first step of the function is to run the *predict_on_batch *Keras model API function. This function just runs the *model.call *function defined above. The output is both the values and the policy logits. An action is then selected by randomly choosing an action based on the action probabilities. Note that *tf.random.categorical* takes as input logits, *not* softmax outputs. The next function, outside of the Model class, is the function that calculates the critic loss:

def critic_loss(discounted_rewards, predicted_values): return keras.losses.mean_squared_error(discounted_rewards, predicted_values) * CRITIC_LOSS_WEIGHT

As explained above, the critic loss comprises of the mean squared error between the discounted rewards (which is calculated in another function, soon to be discussed) and the values predicted from the value output of the model (which are accumulated in a list during the agent’s trajectory through the game).

The following function shows the actor loss function:

def actor_loss(combined, policy_logits): actions = combined[:, 0] advantages = combined[:, 1] sparse_ce = keras.losses.SparseCategoricalCrossentropy( from_logits=True, reduction=tf.keras.losses.Reduction.SUM ) actions = tf.cast(actions, tf.int32) policy_loss = sparse_ce(actions, policy_logits, sample_weight=advantages) probs = tf.nn.softmax(policy_logits) entropy_loss = keras.losses.categorical_crossentropy(probs, probs) return policy_loss * ACTOR_LOSS_WEIGHT - entropy_loss * ENTROPY_LOSS_WEIGHT

The first argument to the *actor_loss* function is an array with two columns (and BATCH_SIZE rows). The first column corresponds to the recorded actions of the agent as it traversed the game. The second column is the calculated advantages – the calculation of which will be shown shortly. Next, the sparse categorical cross-entropy function class is created. The arguments specify that the input to the function is logits (i.e. they don’t have softmax applied to them yet), and it also specifies the reduction to apply to the BATCH_SIZE number of calculated losses – in this case, a sum() function which aligns with the summation in:

Next, the actions are cast to be integers (rather than floats) and finally, the policy loss is calculated based on the sparse_ce function. As discussed above, the sparse categorical cross-entropy function will select those policy probabilities that correspond to the actions actually taken in the game, and weight them by the advantage values. By applying a summation reduction, the formula above will be implemented in this function.

Next, the actual probabilities for action are estimated by applying the softmax function to the logits, and the entropy loss is calculated by applying the categorical cross-entropy function. See the previous discussion on how this works.

The following function calculates the discounted reward values and the advantages:

def discounted_rewards_advantages(rewards, dones, values, next_value): discounted_rewards = np.array(rewards + [next_value[0]]) for t in reversed(range(len(rewards))): discounted_rewards[t] = rewards[t] + GAMMA * discounted_rewards[t+1] * (1-dones[t]) discounted_rewards = discounted_rewards[:-1] # advantages are bootstrapped discounted rewards - values, using Bellman's equation advantages = discounted_rewards - np.stack(values)[:, 0] return discounted_rewards, advantages

The first input value to this function is a list of all the rewards that were accumulated during the agent’s traversal of the game. The next is a list, the elements of which is either 1 or 0 depending on whether the game or episode ended or not at each time step. The *values* argument is a list of all values, $V(s)$ generated by the model at each time step. Finally, the *next_value *argument is the bootstrapped estimate of the value for the state $s_{t+1}$ – in other words, it is the estimated value of all the discounted rewards “downstream” of the last state recorded in the lists. Further discussion on bootstrapping was given in a previous section.

On the first line of the function, a numpy array is created out of the list of rewards, with the bootstrapped *next_value* appended to it. A reversed loop is then entered into. To explain how this loop works, it is perhaps best to give a simple example. For every time-step in the Cartpole environment, if the stick hasn’t fallen past horizontal, a reward of 1 is awarded. So let’s consider a small batch of samples of only 3 time steps. Let’s also say that the bootstrapped *next_value *estimate is 0.5. Therefore, at this point, the discounted rewards array looks like the following: [1, 1, 1, 0.5].

This is what the discounted_rewards array looks like at each step in the loop:

t = 2 — discounted_rewards[2] = 1 + GAMMA * 0.5

t = 1 — discounted_rewards[1] = 1 + GAMMA(1 + GAMMA * 0.5) = 1 + GAMMA + GAMMA^2 * 0.5

t = 0 — discounted_rewards[0] = 1 + GAMMA(1 + GAMMA + GAMMA^2 * 0.5) = 1 + GAMMA + GAMMA^2 + GAMMA^3 * 0.5

As can be observed, this loop correctly generates the downstream discounted rewards values for each step in the batch. If the game is finished in one of these time-steps, the accumulation of discounted future rewards will be reset, so that rewards from a subsequent game wont flow into the previous game that just ended.

Because discounted_rewards[3] actually equals the bootstrapped *next_value*, it doesn’t apply to calculating the advantage, so the next line in the code simply restricts the scope of the discounted_rewards array so that this *next_value *is excluded.

Next, the advantages are calculated, simply by subtracting the estimated values from the discounted_rewards.

The following lines of code create a model instance, compile the model, and set up a TensorBoard writer for visualization purposes.

model = Model(num_actions) model.compile(optimizer=keras.optimizers.Adam(), loss=[critic_loss, actor_loss]) train_writer = tf.summary.create_file_writer(STORE_PATH + f"/A2C-CartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")

Note that in the model compilation function, the loss function specified is a compound of the critic and actor loss (with the actor loss featuring the entropy impact, as shown above).

The code below shows the main training loop:

num_steps = 10000000 episode_reward_sum = 0 state = env.reset() episode = 1 for step in range(num_steps): rewards = [] actions = [] values = [] states = [] dones = [] for _ in range(BATCH_SIZE): _, policy_logits = model(state.reshape(1, -1)) action, value = model.action_value(state.reshape(1, -1)) new_state, reward, done, _ = env.step(action.numpy()[0]) actions.append(action) values.append(value.numpy()[0]) states.append(state) dones.append(done) episode_reward_sum += reward state = new_state if done: rewards.append(0.0) state = env.reset() print(f"Episode: {episode}, latest episode reward: {episode_reward_sum}, loss: {loss}") with train_writer.as_default(): tf.summary.scalar('rewards', episode_reward_sum, episode) episode_reward_sum = 0 episode += 1 else: rewards.append(reward) _, next_value = model.action_value(state.reshape(1, -1)) discounted_rewards, advantages = discounted_rewards_advantages(rewards, dones, values, next_value.numpy()[0]) # combine the actions and advantages into a combined array for passing to # actor_loss function combined = np.zeros((len(actions), 2)) combined[:, 0] = actions combined[:, 1] = advantages loss = model.train_on_batch(tf.stack(states), [discounted_rewards, combined]) with train_writer.as_default(): tf.summary.scalar('tot_loss', np.sum(loss), step)

At the beginning of each “step” or batch number, all of the lists (rewards, actions, values, states, dones) are emptied. A secondary loop is then entered into, which accumulates all of these lists. Within this inner loop, the action logits are generated from the model, and the actual action to be taken (*action *variable) and the state value (*value* variable) are retrieved from the *model.action_value* function. The action is then fed into the environment so that a step can be taken. This generates a new state, the reward for taking that action, and the done flag – signifying if that action ended the game. All of these values are then appended to the various lists, and the episode reward accumulator is added to.

If the episode is done, the environment is reset and the total episode rewards are stored in the TensorBoard writer. If not, the reward is simply stored in the list.

After the BATCH_SIZE number of episodes is stored in the list, the inner loop is exited and it is time to train the model. The *next_value* bootstrapped value estimate is generated (recall the *state* variable has been updated to the *next_state* variable i.e. the next state in the game), and the discounted rewards and advantages are calculated. Next, a combined array is created and populated column-wise by the actions and advantages. These are then passed to the *model.train_on_batch* function. The *discounted_rewards* and *combined *variables are passed to this function, which will, in turn, be automatically fed into the critic and actor loss functions, respectively (along with the outputs from the *model.call *function – the value estimate and the policy logits).

The loss is returned and finally, this is logged also.

The outcome of training the Cartpole environment for 200 episodes can be seen in the graph below:

That’s the end of this tutorial on the powerful A2C reinforcement learning algorithm, and how to implement it in TensorFlow 2. In a future post, I will demonstrate how to apply this technique to a more challenging Atari game environment, making use of convolutional neural network layers and the actual game screen pixels.

I hope this was useful for you – all the best.

]]>Google’s TensorFlow has been a hot topic in deep learning recently. The open source software, designed to allow efficient computation of data flow graphs, is especially suited to deep learning tasks. It is designed to be executed on single or multiple CPUs and GPUs, making it a good option for complex deep learning tasks. In its most recent incarnation – version 1.0 – it can even be run on certain mobile operating systems. This introductory tutorial to TensorFlow will give an overview of some of the basic concepts of TensorFlow in Python. These will be a good stepping stone to building more complex deep learning networks, such as Convolution Neural Networks, natural language models, and Recurrent Neural Networks in the package. We’ll be creating a simple three-layer neural network to classify the MNIST dataset. This tutorial assumes that you are familiar with the basics of neural networks, which you can get up to scratch with in the neural networks tutorial if required. To install TensorFlow, follow the instructions here. The code for this tutorial can be found in this site’s GitHub repository. Once you’re done, you also might want to check out a higher level deep learning library that sits on top of TensorFlow called Keras – see my Keras tutorial.

First, let’s have a look at the main ideas of TensorFlow.

TensorFlow is based on graph based computation – “what on earth is that?”, you might say. It’s an alternative way of conceptualising mathematical calculations. Consider the following expression $a = (b + c) * (c + 2)$. We can break this function down into the following components:

\begin{align}

d &= b + c \\

e &= c + 2 \\

a &= d * e

\end{align}

Now we can represent these operations graphically as:

This may seem like a silly example – but notice a powerful idea in expressing the equation this way: two of the computations ($d=b+c$ and $e=c+2$) can be performed in parallel. By splitting up these calculations across CPUs or GPUs, this can give us significant gains in computational times. These gains are a *must* for big data applications and deep learning – especially for complicated neural network architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The idea behind TensorFlow is to the ability to create these computational graphs in code and allow significant performance improvements via parallel operations and other efficiency gains.

We can look at a similar graph in TensorFlow below, which shows the computational graph of a three-layer neural network.

The animated data flows between different nodes in the graph are *tensors* which are multi-dimensional data arrays. For instance, the input data tensor may be 5000 x 64 x 1, which represents a 64 node input layer with 5000 training samples. After the input layer, there is a hidden layer with rectified linear units as the activation function. There is a final output layer (called a “logit layer” in the above graph) that uses cross-entropy as a cost/loss function. At each point we see the relevant tensors flowing to the “Gradients” block which finally flows to the Stochastic Gradient Descent optimizer which performs the back-propagation and gradient descent.

Here we can see how computational graphs can be used to represent the calculations in neural networks, and this, of course, is what TensorFlow excels at. Let’s see how to perform some basic mathematical operations in TensorFlow to get a feel for how it all works.

So how can we make TensorFlow perform the little example calculation shown above – $a = (b + c) * (c + 2)$? First, there is a need to introduce TensorFlow variables. The code below shows how to declare these objects:

import tensorflow as tf # create TensorFlow variables const = tf.Variable(2.0, name="const") b = tf.Variable(2.0, name='b') c = tf.Variable(1.0, name='c')

As can be observed above, TensorFlow variables can be declared using the *tf.Variable* function. The first argument is the value to be assigned to the variable. The second is an optional name string which can be used to label the constant/variable – this is handy for when you want to do visualizations. TensorFlow will infer the type of the variable from the initialized value, but it can also be set explicitly using the optional *dtype* argument. TensorFlow has many of its own types like tf.float32, tf.int32 etc.

The objects assigned to the Python variables are actually TensorFlow tensors. Thereafter, they act like normal Python objects – therefore, if you want to access the tensors you need to keep track of the Python variables. In previous versions of TensorFlow, there were global methods of accessing the tensors and operations based on their names. This is no longer the case.

To examine the tensors stored in the Python variables, simply call them as you would a normal Python variable. If we do this for the “const” variable, you will see the following output:

<tf.Variable ‘const:0’ shape=() dtype=float32, numpy=2.0>

This output gives you a few different pieces of information – first, is the name ‘const:0’ which has been assigned to the tensor. Next is the data type, in this case, a TensorFlow float 32 type. Finally, there is a “numpy” value. TensorFlow variables in TensorFlow 2 can be converted easily into numpy objects. Numpy stands for Numerical Python and is a crucial library for Python data science and machine learning. If you don’t know Numpy, what it is, and how to use it, check out this site. The command to access the numpy form of the tensor is simply .numpy() – the use of this method will be shown shortly.

Next, some calculation operations are created:

# now create some operations d = tf.add(b, c, name='d') e = tf.add(c, const, name='e') a = tf.multiply(d, e, name='a')

Note that *d *and *e *are automatically converted to tensor values upon the execution of the operations. TensorFlow has a wealth of calculation operations available to perform all sorts of interactions between tensors, as you will discover as you progress through this book. The purpose of the operations shown above are pretty obvious, and they instantiate the operations *b + c, c + 2.0, *and *d * e*. However, these operations are an unwieldy way of doing things in TensorFlow 2. The operations below are equivalent to those above:

d = b + c e = c + 2 a = d * e

To access the value of variable *a*, one can use the *.numpy()* method as shown below:

print(**f”Variable a is {a.numpy()}”**)

The computational graph for this simple example can be visualized by using the TensorBoard functionality that comes packaged with TensorFlow. This is a great visualization feature and is explained more in this post. Here is what the graph looks like in TensorBoard:

The larger two vertices or nodes, *b *and *c,* correspond to the variables. The smaller nodes correspond to the operations, and the edges between the vertices are the scalar values emerging from the variables and operations.

The example above is a trivial example – what would this look like if there was an array of *b* values from which an array of equivalent *a* values would be calculated? TensorFlow variables can easily be instantiated using numpy variables, like the following:

b = tf.Variable(np.arange(0, 10), name='b')

Calling *b* shows the following:

<tf.Variable ‘b:0’ shape=(10,) dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

Note the numpy value of the tensor is an array. Because the numpy variable passed during the instantiation is a range of int32 values, we can’t add it directly to *c* as *c* is of float32 type. Therefore, the tf.cast operation, which changes the type of a tensor, first needs to be utilized like so:

d = tf.cast(b, tf.float32) + c

Running the rest of the previous operations, using the new *b* tensor, gives the following value for *a*:

Variable a is [ 3. 6. 9. 12. 15. 18. 21. 24. 27. 30.]

In numpy, the developer can directly access *slices* or individual indices of an array and change their values directly. Can the same be done in TensorFlow 2? Can individual indices and/or slices be accessed and changed? The answer is yes, but not quite as straight-forwardly as in numpy. For instance, if *b* was a simple numpy array, one could easily execute the following b[1] = 10 – this would change the value of the second element in the array to the integer 10.

b[1].assign(10)

This will then flow through to *a* like so:

Variable a is [ 3. 33. 9. 12. 15. 18. 21. 24. 27. 30.]

The developer could also run the following, to assign a slice of *b* values:

b[6:9].assign([10, 10, 10])

A new tensor can also be created by using the slice notation:

f = b[2:5]

The explanations and code above show you how to perform some basic tensor manipulations and operations. In the section below, an example will be presented where a neural network is created using the Eager paradigm in TensorFlow 2. It will show how to create a training loop, perform a feed-forward pass through a neural network and calculate and apply gradients to an optimization method.

In this section, a simple three-layer neural network build in TensorFlow is demonstrated. In following chapters more complicated neural network structures such as convolution neural networks and recurrent neural networks are covered. For this example, though, it will be kept simple.

In this example, the MNIST dataset will be used that is packaged as part of the TensorFlow installation. This MNIST dataset is a set of 28×28 pixel grayscale images which represent hand-written digits. It has 60,000 training rows, 10,000 testing rows, and 5,000 validation rows. It is a very common, basic, image classification dataset that is used in machine learning.

The data can be loaded by running the following:

from tensorflow.keras.datasets import mnist (x_train, y_train), (x_test, y_test) = mnist.load_data()

As can be observed, the Keras MNIST data loader returns Python tuples corresponding to the training and test set respectively (Keras is another deep learning framework, now tightly integrated with TensorFlow, as mentioned earlier). The data sizes of the tuples defined above are:

*x_train:*(60,000 x 28 x 28)*y_train:*(60,000)*x_test:*(10,000 x 28 x 28)*y_test:*(10,000)

The *x* data is the image information – 60,000 images of 28 x 28 pixels size in the training set. The images are grayscale (i.e black and white) with maximum values, specifying the intensity of whites, of 255. The *x *data will need to be scaled so that it resides between 0 and 1, as this improves training efficiency. The *y* data is the matching image labels – signifying what digit is displayed in the image. This will need to be transformed to “one-hot” format.

When using a standard, categorical cross-entropy loss function (this will be shown later), a one-hot format is required when training classification tasks, as the output layer of the neural network will have the same number of nodes as the total number of possible classification labels. The output node with the highest value is considered as a prediction for that corresponding label. For instance, in the MNIST task, there are 10 possible classification labels – 0 to 9. Therefore, there will be 10 output nodes in any neural network performing this classification task. If we have an example output vector of [0.01, 0.8, 0.25, 0.05, 0.10, 0.27, 0.55, 0.32, 0.11, 0.09], the maximum value is in the second position / output node, and therefore this corresponds to the digit “1”. To train the network to produce this sort of outcome when the digit “1” appears, the loss needs to be calculated according to the difference between the output of the network and a “one-hot” array of the label 1. This one-hot array looks like [0, 1, 0, 0, 0, 0, 0, 0, 0, 0].

This conversion is easily performed in TensorFlow, as will be demonstrated shortly when the main training loop is covered.

One final thing that needs to be considered is how to extract the training data in batches of samples. The function below can handle this:

def get_batch(x_data, y_data, batch_size): idxs = np.random.randint(0, len(y_data), batch_size) return x_data[idxs,:,:], y_data[idxs]

As can be observed in the code above, the data to be batched i.e. the *x *and *y* data is passed to this function along with the batch size. The first line of the function generates a random vector of integers, with random values between 0 and the length of the data passed to the function. The number of random integers generated is equal to the batch size. The *x *and *y* data are then returned, but the return data is only for those random indices chosen. Note, that this is performed on numpy array objects – as will be shown shortly, the conversion from numpy arrays to tensor objects will be performed “on the fly” within the training loop.

There is also the requirement for a loss function and a feed-forward function, but these will be covered shortly.

# Python optimisation variables epochs = 10 batch_size = 100 # normalize the input images by dividing by 255.0 x_train = x_train / 255.0 x_test = x_test / 255.0 # convert x_test to tensor to pass through model (train data will be converted to # tensors on the fly) x_test = tf.Variable(x_test)

First, the number of training epochs and the batch size are created – note these are simple Python variables, not TensorFlow variables. Next, the input training and test data, *x_train* and *x_test*, are scaled so that their values are between 0 and 1. Input data should always be scaled when training neural networks, as large, uncontrolled, inputs can heavily impact the training process. Finally, the test input data, *x_test* is converted into a tensor. The random batching process for the training data is most easily performed using numpy objects and functions. However, the test data will not be batched in this example, so the full test input data set *x_test* is converted into a tensor.

The next step is to setup the weight and bias variables for the three-layer neural network. There are always *L* – *1* number of weights/bias tensors, where *L* is the number of layers. These variables are defined in the code below:

# now declare the weights connecting the input to the hidden layer W1 = tf.Variable(tf.random.normal([784, 300], stddev=0.03), name='W1') b1 = tf.Variable(tf.random.normal([300]), name='b1') # and the weights connecting the hidden layer to the output layer W2 = tf.Variable(tf.random.normal([300, 10], stddev=0.03), name='W2') b2 = tf.Variable(tf.random.normal([10]), name='b2')

The weight and bias variables are initialized using the *tf.random.normal* function – this function creates tensors of random numbers, drawn from a normal distribution. It allows the developer to specify things like the standard deviation of the distribution from which the random numbers are drawn.

Note the shape of the variables. The W1 variable is a [784, 300] tensor – the 784 nodes are the size of the input layer. This size comes from the flattening of the input images – if we have 28 rows and 28 columns of pixels, flattening these out gives us 1 row or column of 28 x 28 = 784 values. The 300 in the declaration of W1 is the number of nodes in the hidden layer. The W2 variable is a [300, 10] tensor, connecting the 300-node hidden layer to the 10-node output layer. In each case, a name is given to the variable for later viewing in TensorBoard – the TensorFlow visualization package. The next step in the code is to create the computations that occur within the nodes of the network. If the reader recalls, the computations within the nodes of a neural network are of the following form:

$$z = Wx + b$$

$$h=f(z)$$

Where *W* is the weights matrix, *x* is the layer input vector, *b* is the bias and *f* is the activation function of the node. These calculations comprise the feed-forward pass of the input data through the neural network. To execute these calculations, a dedicated feed-forward function is created:

def nn_model(x_input, W1, b1, W2, b2): # flatten the input image from 28 x 28 to 784 x_input = tf.reshape(x_input, (x_input.shape[0], -1)) x = tf.add(tf.matmul(tf.cast(x_input, tf.float32), W1), b1) x = tf.nn.relu(x) logits = tf.add(tf.matmul(x, W2), b2) return logits

Examining the first line, the *x_input* data is reshaped from (batch_size, 28, 28) to (batch_size, 784) – in other words, the images are flattened out. On the next line, the input data is then converted to *tf.float32* type using the TensorFlow cast function. This is important – the *x_input* data comes in as *tf.float64 *type, and TensorFlow won’t perform a matrix multiplication operation (*tf.matmul*) between tensors of different data types. This re-typed input data is then matrix-multiplied by *W1* using the TensorFlow *matmul* function (which stands for matrix multiplication). Then the bias *b1* is added to this product. On the line after this, the ReLU activation function is applied to the output of this line of calculation. The ReLU function is usually the best activation function to use in deep learning – the reasons for this are discussed in this post.

The output of this calculation is then multiplied by the final set of weights *W2*, with the bias *b2* added. The output of this calculation is titled *logits*. Note that no activation function has been applied to this output layer of nodes (yet). In machine/deep learning, the term “logits” refers to the un-activated output of a layer of nodes.

The reason no activation function has been applied to this layer is that there is a handy function in TensorFlow called *tf.nn.softmax_cross_entropy_with_logits*. This function does two things for the developer – it applies a softmax activation function to the logits, which transforms them into a quasi-probability (i.e. the sum of the output nodes is equal to 1). This is a common activation function to apply to an output layer in classification tasks. Next, it applies the cross-entropy loss function to the softmax activation output. The cross-entropy loss function is a commonly used loss in classification tasks. The theory behind it is quite interesting, but it won’t be covered in this book – a good summary can be found here. The code below applies this handy TensorFlow function, and in this example, it has been nested in another function called *loss_fn*:

def loss_fn(logits, labels): cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)) return cross_entropy

The arguments to *softmax_cross_entropy_with_logits *are *labels* and *logits*. The *logits* argument is supplied from the outcome of the *nn_model function*. The usage of this function in the main training loop will be demonstrated shortly. The *labels* argument is supplied from the *one-hot* *y *values that are fed into *loss_fn* during the training process. The output of the *softmax_cross_entropy_with_logits* function will be the output of the cross-entropy loss value for each sample in the batch. To train the weights of the neural network, the average cross-entropy loss across the samples needs to be minimized as part of the optimization process. This is calculated by using the *tf.reduce_mean* function, which, unsurprisingly, calculates the mean of the tensor supplied to it.

The next step is to define an optimizer function. In many examples within this book, the versatile *Adam* optimizer will be used. The theory behind this optimizer is interesting, and is worth further examination (such as shown here) but won’t be covered in detail within this post. It is basically a gradient descent method, but with sophisticated averaging of the gradients to provide appropriate momentum to the learning. To define the optimizer, which will be used in the main training loop, the following code is run:

# setup the optimizer optimizer = tf.keras.optimizers.Adam()

The *Adam* object can take a learning rate as input, but for the present purposes, the default value is used.

Now that the appropriate functions, variables and optimizers have been created, it is time to define the overall training loop. The training loop is shown below:

total_batch = int(len(y_train) / batch_size) for epoch in range(epochs): avg_loss = 0 for i in range(total_batch): batch_x, batch_y = get_batch(x_train, y_train, batch_size=batch_size) # create tensors batch_x = tf.Variable(batch_x) batch_y = tf.Variable(batch_y) # create a one hot vector batch_y = tf.one_hot(batch_y, 10) with tf.GradientTape() as tape: logits = nn_model(batch_x, W1, b1, W2, b2) loss = loss_fn(logits, batch_y) gradients = tape.gradient(loss, [W1, b1, W2, b2]) optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2])) avg_loss += loss / total_batch test_logits = nn_model(x_test, W1, b1, W2, b2) max_idxs = tf.argmax(test_logits, axis=1) test_acc = np.sum(max_idxs.numpy() == y_test) / len(y_test) print(f"Epoch: {epoch + 1}, loss={avg_loss:.3f}, test set accuracy={test_acc*100:.3f}%") print("\nTraining complete!")

Stepping through the lines above, the first line is a calculation to determine the number of batches to run through in each training epoch – this will ensure that, on average, each training sample will be used once in the epoch. After that, a loop for each training epoch is entered. An *avg_cost* variable is initialized to keep track of the average cross entropy cost/loss for each epoch. The next line is where randomised batches of samples are extracted (batch_x and batch_y) from the MNIST training dataset, using the *get_batch()* function that was created earlier.

Next, the *batch_x* and *batch_y* numpy variables are converted to tensor variables. After this, the label data stored in *batch_y* as simple integers (i.e. 2 for handwritten digit “2” and so on) needs to be converted to “one hot” format, as discussed previously. To do this, the *tf.one_hot* function can be utilized – the first argument to this function is the tensor you wish to convert, and the second argument is the number of distinct classes. This transforms the *batch_y *tensor from size (batch_size, 1) to (batch_size, 10).

The next line is important. Here the TensorFlow *GradientTape *API is introduced. In previous versions of TensorFlow a static graph of all the operations and variables was constructed. In this paradigm, the gradients that were required to be calculated could be determined by reading from the graph structure. However, in Eager mode, all tensor calculations are performed on the fly, and TensorFlow doesn’t know which variables and operations you are interested in calculating gradients for. The Gradient Tape API is the solution for this. Whatever variables and operations you wish to calculate gradients over you supply to the “*with GradientTape() as tape:*” context manager. In a neural network, this involves all the variables and operations involved in the feed-forward pass through your network, along with the evaluation of the loss function. Note that if you call a function within the gradient tape context, all the operations performed within that function (and any further nested functions), will be captured for gradient calculation as required.

As can be observed in the code above, the feed forward pass and the loss function evaluation are encapsulated in the functions which were explained earlier: *nn_model *and *loss_fn*. By executing these functions within the gradient tape context manager, TensorFlow knows to keep track of all the variables and operation outcomes to ensure they are ready for gradient computations. Following the function calls *nn_model *and *loss_fn* within the gradient tape context, we have the place where the gradients of the neural network are calculated.

Here, the gradient tape is accessed via its name (*tape* in this example) and the gradient function is called *tape.gradient()*. The first argument to this function is the dependent variable of the differentiation, and the second argument is the independent variable/s. In other words, if we were trying to calculate the derivative *dy/dx,* the first argument would be *y* and the second would be *x* for this function. In the context of a neural network, we are trying to calculate *dL/dw *and *dL/db* where *L *is the loss, *w *represents the weights and *b* the weights of the bias connections. Therefore, in the code above, the reader can observe that the first argument is the *loss *output from *loss_fn* and the second argument is a list of all the weight and bias variables through-out the simple neural network.

The next line is where these gradients are zipped together with the weight and bias variables and passed to the optimizer to perform the gradient descent step. This is executed easily using the optimizer’s *apply_gradients()* function.

The line following this is the accumulation of the average loss within the epoch. This constitutes the inner-epoch training loop. In the outer epoch training loop, after each epoch of training, the accuracy of the model on the test set is evaluated.

To determine the accuracy, first the test set images are passed through the neural network model using *nn_model*. This returns the *logits* from the model (the un-activated outputs from the last layer). The “prediction” of the model is then calculated from these logits – whatever output node has the highest logits value, this constitutes the digit prediction of the model. To determine what the highest logit value is for each test image, we can use the *tf.argmax()* function. This function mimics the numpy *argmax()* function, which returns the index of the highest value in an array/tensor. The logits output from the model in this case will be of the following dimensions: (test_set_size, 10) – we want the argmax function to find the maximum in each of the “column” dimensions i.e. across the 10 output nodes. The “row” dimension corresponds to axis=0, and the column dimension corresponds to axis=1. Therefore, supplying the axis=1 argument to *tf.argmax()* function creates (test_set_size, 1) integer predictions.

In the following line, these *max_idxs* are converted to a numpy array (using .*numpy()*) and asserted to be equal to the test labels (also integers – you will recall that we did not convert the test labels to a one-hot format). Where the labels are equal, this will return a “true” value, which is equivalent to an integer of 1 in numpy, or alternatively a “false” / 0 value. By summing up the results of these assertions, we obtain the number of correct predictions. Dividing this by the total size of the test set, the test set accuracy is obtained.

__Note__: if some of these explanations aren’t immediately clear, it is a good idea to jump over to the code supplied for this chapter and running it within a standard Python development environment. Insert a breakpoint in the code that you want to examine more closely – you can then inspect all the tensor sizes, convert them to numpy arrays, apply operations on the fly and so on. This is all possible within TensorFlow 2 now that the default operating paradigm is Eager execution.

The epoch number, average loss and accuracy are then printed, so one can observe the progress of the training. The average loss should be decreasing on average after every epoch – if it is not, something is going wrong with the network, or the learning has stagnated. Therefore, it is an important variable to monitor. On running this code, something like the following output should be observed:

Epoch: 1, cost=0.317, test set accuracy=94.350%

Epoch: 2, cost=0.124, test set accuracy=95.940%

Epoch: 3, cost=0.085, test set accuracy=97.070%

Epoch: 4, cost=0.065, test set accuracy=97.570%

Epoch: 5, cost=0.052, test set accuracy=97.630%

Epoch: 6, cost=0.048, test set accuracy=97.620%

Epoch: 7, cost=0.037, test set accuracy=97.770%

Epoch: 8, cost=0.032, test set accuracy=97.630%

Epoch: 9, cost=0.027, test set accuracy=97.950%

Epoch: 10, cost=0.022, test set accuracy=98.000%

Training complete!

As can be observed, the loss declines monotonically, and the test set accuracy steadily increases. This shows that the model is training correctly. It is also possible to visualize the training progress using TensorBoard, as shown below:

I hope this tutorial was instructive and helps get you going on the TensorFlow journey. Just a reminder, you can check out the code for this post here. I’ve also written an article that shows you how to build more complex neural networks such as convolution neural networks, recurrent neural networks, and Word2Vec natural language models in TensorFlow. You also might want to check out a higher level deep learning library that sits on top of TensorFlow called Keras – see my Keras tutorial.

Have fun!

]]>The code contained in this tutorial can be found on this site’s Github repository.

Bayes theorem is one of the most important statistical concepts a machine learning practitioner or data scientist needs to know. In the machine learning context, it can be used to estimate the model parameters (e.g. the weights in a neural network) in a statistically robust way. It can also be used in model selection e.g. choosing which machine learning model is the best to address a given problem. I won’t be going in-depth into all the possible uses of Bayes theorem here, however, but I will be introducing the main components of the theorem.

Bayes theorem can be shown in a fairly simple equation involving *conditional probabilities* as follows:

$$P(\theta \vert D) = \frac{P(D \vert \theta) P(\theta)}{P(D)}$$

In this representation, the variable $\theta$ corresponds to the model parameters (i.e. the values of the weights in a neural network), and the variable $D$ corresponds to the data that we are using to estimate the $\theta$ values. Before I talk about what *conditional probabilities *are, I’ll just quickly point out three terms in this formula which are *very important* to familiarise yourself with, as they come up in the literature all the time. It is worthwhile memorizing what these terms refer to:

$P(\theta \vert D)$ – this is called the *posterior*

$P(D \vert \theta)$ – this is called the *likelihood*

$P(\theta)$ – this is called the *prior*

I’m going to explain what all these terms refer to shortly, but first I’ll make a quick detour to discuss *conditional probability* for those who may not be familiar. If you are already familiar with conditional probability, feel free to skip this section.

Conditional probability is an important statistical concept that is thankfully easy to understand, as it forms a part of our everyday reasoning. Let’s say we have a random variable called *RT* which represents whether it will rain today – it is a discrete variable and can take on the value of either 1 or 0, denoting whether it will rain today or not. Let’s say we are in a fairly dry environment, and by consulting some long-term rainfall records we know that *RT=1 *about 10% of the time, and therefore *RT=0 *about *90%* of the time. This fully represents the probability function for *RT* which can be written as *P(RT**)*. Therefore, we have some *prior* knowledge of what *P(RT)* is in the absence of any other determining factors.

Ok, so what does *P(RT)* look like if we know it rained yesterday? Is it the same or is it different? Well, let’s say the region we are in gets most of its rainfall due to big weather systems that can last for days or weeks – in this case, we have good reason to believe that *P(RT) *will be different *given the fact that it rained yesterday*. Therefore, the probability *P(RT)* is now *conditioned* on our understanding of another random variable* P(RY)* which represents whether it has rained *yesterday*. The way of showing this *conditional probability *is by using the vertical slash symbol $\vert$ – so the conditional probability that it will rain today given it rained yesterday looks like the following: $P(RT=1 \vert RY = 1)$. Perhaps for this reason the probability that it will rain today is no longer 10%, but maybe will rise to 30%, so $P(RT=1 \vert RY = 1) = 0.3$

We could also look at other probabilities, such as $P(RT=1 \vert RY = 0)$ or $P(RT=0 \vert RY = 1)$ and so on. To generalize this relationship we would just write $P(RT \vert RY)$.

Now that you have an understanding of conditional probabilities, let’s move on to explaining Bayes Theorem (which contains two conditional probability functions) in more detail.

Ok, so as I stated above, it is time to delve into the meaning of the individual terms of Bayes theorem. Let’s first look at the *posterior* term – $P(\theta \vert D)$. This term can be read as: given we have a certain dataset $D$, what is the probability of our parameters $\theta$? This is the term we want to maximize when varying the parameters of a model according to a dataset – by doing so, we find those parameters $\theta$ *which are most probable *given the model we are using and the training data supplied. The *posterior* is on the left-hand side of the equation of Bayes Theorem, so if we want to maximize the *posterior* we can do this by maximizing the right-hand side of the equation.

Let’s have a look at the terms on the right-hand side.

The likelihood is expressed as $P(D \vert \theta)$ and can be read as: given this parameter $\theta$, which defines some process of generating data, what is the probability we would see this given set of data $D$? Let’s say we have a scattering of data-points – a good example might be the heights of all the members of a classroom full of kids. We can define a model that we assume is able to generate or represent this data – in this case, the Normal distribution is a good choice. The parameters that we are trying to determine in the Normal distribution is the tuple ($\mu$, $\sigma$) – the mean and variance of the Normal distribution.

So the likelihood $P(D \vert \theta)$ in this example is the probability of seeing this sample of measured heights given different values of the mean and variance of the Normal distribution function. There is some more mathematical precision needed here (such as the difference between a probability distribution and a probability density function, discrete samples etc.) but this is ok for our purposes of coming to a conceptual understanding.

I’ll come back to the concept of the likelihood shortly when we discuss maximum likelihood estimation, but for now, let’s move onto the *prior*.

The *prior* probability $P(\theta)$, as can be observed, is not a conditioned probability distribution. It is simply a representation of the probability of the parameters *prior* to any other consideration of data or *evidence*. You may be puzzled as to what the point of this probability is. In the context of machine learning or probabilistic programming, it’s purpose is to enable us to specify some *prior* understanding of what the parameters should actually be, and the prior probability distribution it should be drawn from.

Returning to the example of the heights of kids in a classroom. Let’s say the teacher is a pretty good judge of heights, and therefore he or she can come to the problem with a rough *prior* estimate of what the mean height would be. Let’s say he or she guesses that the average height is around 130cm. He can then put a *prior* around the mean parameter $\mu$ of, say, a normal distribution with a mean of 130cm.

The presence of the prior in the Bayes theorem allows us to introduce expert knowledge or prior beliefs into the problem, which aids the finding of the optimal parameters $\theta$. These prior beliefs are then *updated* by the data collected $D$ – with the updating occurring through the action of the likelihood function.

The graph below is an example of the evolution of a *prior* distribution function exposed to some set of data:

Here we can see that, through the application of the Bayes Theorem, we can start out with a certain set of *prior* beliefs in the form of a *prior* distribution function, but by applying the evidence or data through the *likelihood $P(D \vert \theta)$, *the posterior estimate $P(\theta \vert D)$ moves closer to “reality”.

The final term in Bayes Theorem is the unconditioned probability distribution of the process that generated the data $P(D)$. In machine learning applications, this distribution is often unknown – but thankfully, it doesn’t matter. This distribution acts as a normalization constant and has nothing to say about the parameters we are trying to estimate $\theta$. Therefore, because we are trying to simply maximize the right-hand side of the equation, it drops out of any derivative calculation that is made in order to find the maximum. So in the context of machine learning and estimating parameters, this term can be safely ignored. Given this understanding, the form of Bayes Theorem that we are mostly interested in for machine learning purposes is as follows: $$P(\theta \vert D) \propto P(D \vert \theta) P(\theta)$$

Given this formulation, all we are concerned about is either maximizing the right-hand side of the equation or by simulating the sampling of the posterior itself (not covered in this post).

Now that we have reviewed conditional probability concepts and Bayes Theorem, it is now time to consider how to apply Bayes Theorem in practice to estimate the best parameters in a machine learning problem. There are a number of ways of estimating the posterior of the parameters in a machine learning problem. These include maximum likelihood estimation, maximum a posterior probability (MAP) estimation, simulating the sampling from the posterior using Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling, and so on. In this post, I will just be considering maximum likelihood estimation (MLE) with other methods being considered in future content on this site.

What happens if we just throw our hands up in the air with regards to the prior $P(\theta)$ and say we don’t know anything about the best parameters to describe the data? In that case, the prior becomes a uniform or un-informative prior – in that case, $P(\theta)$ becomes a constant (same probability no matter what the parameter values are), and our Bayes Theorem reduces to:

$$P(\theta \vert D) \propto P(D \vert \theta)$$

If this is the case, all we have to do is maximize the likelihood $P(D \vert \theta)$ and by doing so we will also find the maximum of the posterior – i.e. the parameter with the highest probability given our model and data – or, in short, an estimate of the optimal parameters. If we have a way of calculating $P(D \vert \theta)$ while varying the parameters $\theta$, we can then feed this into some sort of optimizer to calculate:

$$\underset{\theta}{\operatorname{argmax}} P(D \vert \theta)$$

Nearly always, instead of maximizing $P(D \vert \theta)$ the log of $P(D \vert \theta)$ is maximized. Why? If we were doing the calculations by hand, we would need to calculate the derivative of the product of multiple exponential functions (as probability functions like the Normal distribution have exponentials in them) which is tricky. Because logs are monotonically increasing functions, they have maximums at the same point as the non-log function. So in other words, the maximum likelihood will occur at the same parameter value as the maximum of the log likelihood. By taking the log of the likelihood, products turn into sums and this makes derivative calculations a whole lot easier.

Finally, some optimizers in machine learning packages such as TensorFlow only *minimize* loss functions, so we need to invert the sign of the loss function in order to *maximize* it. In that case, for maximum likelihood estimation, we would *minimize* the negative log likelihood, or NLL, and get the same result.

Let’s look at a simple example of maximum likelihood estimation by using TensorFlow Probability.

For the simple example of maximum likelihood estimation that is to follow, TensorFlow Probability is overkill – however, TensorFlow Probability is a great extension of TensorFlow into the statistical domain, so it is worthwhile introducing MLE by utilizing it. The Jupyter Notebook containing this example can be found at this site’s Github repository. Note this example is loosely based on the TensorFlow tutorial found here. In this example, we will be estimating linear regression parameters based on noisy data. These parameters can obviously be solved using analytical techniques, but that isn’t as interesting. First, we import some libraries and generate the noisy data:

import tensorflow as tf import tensorflow_probability as tfp import numpy as np import matplotlib.pylab as plt tfd = tfp.distributions x_range = np.arange(0, 10, 0.1) grad = 2.0 intercept = 3.0 lin_reg = x_range * grad + np.random.normal(0, 3.0, len(x_range)) + intercept

Plotting our noisy regression line looks like the following:

Next, let’s set up our little model to predict the underlying regression function from the noisy data:

model = tf.keras.Sequential([ tf.keras.layers.Dense(1), tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x, scale=1)), ])

So here we have a simple Keras sequential model (for more detail on Keras and TensorFlow, see this post). The first layer is a Dense layer with one node. Given each Dense layer has one bias input by default – this layer equates to generating a simple line with a gradient and intercept: $xW + b$ where *x* is the input data, *W* is the single input weight and *b* is the bias weight. So the first Dense layer produces a line with a trainable gradient and y-intercept value.

The next layer is where TensorFlow Probability comes in. This layer allows you to create a parameterized probability distribution, with the parameter being “fed in” from the output of previous layers. In this case, you can observe that the lambda *x, *which is the output from the previous layer, is defining the mean of a Normal distribution. In this case, the scale (i.e. the standard deviation) is fixed to 1.0. So, using TensorFlow probability, our model no longer will just predict a single value for each input (as in a non-probabilistic neural network) – no, instead the output is actually a Normal distribution. In that case, to actually predict values we need to call statistical functions from the output of the model. For instance:

- model(np.array([[1.0]])).sample(10) will produce a random sample of 10 outputs from the Normal distribution, parameterized by the input value 1.0 fed through the first Dense layer
- model(np.array([[1.0]])).mean() will produce the mean of the distribution, given the input
- model(np.array([[1.0]])).stddev() will produce the standard deviation of the distribution, given the input

and so on. We can also calculate the log probability of the output distribution, as will be discussed shortly. Next, we need to set up our “loss” function – in this case, our “loss” function is actually just the negative log likelihood (NLL):

def neg_log_likelihood(y_actual, y_predict): return -y_predict.log_prob(y_actual)

In the above, the *y_actual* values are the actual noisy training samples. The values *y_predict* are actually a tensor of parameterized Normal probability distributions – one for each different training input. So, for instance, if one training input is 5.0, the corresponding *y_predict *value will be a Normal distribution with a mean value of, say, 12. Another training input may have a value 10.0, and the corresponding *y_predict* will be a Normal distribution with a mean value of, say, 20, and so on. Therefore, for each *y_predict *and *y_actual* pair, it is possible to calculate the log probability of that actual value occurring given the predicted Normal distribution.

To make this more concrete – let’s say for a training input value 5.0, the corresponding actual noisy regression value is 8.0. However, let’s say the predicted Normal distribution has a mean of 10.0 (and a fixed variance of 1.0). Using the formula for the log probability / log likelihood of a Normal distribution:

$$\ell_x(\mu,\sigma^2) = – \ln \sigma – \frac{1}{2} \ln (2 \pi) – \frac{1}{2} \Big( \frac{x-\mu}{\sigma} \Big)^2$$

Substituting in the example values mentioned above:

$$\ell_x(10.0,1.0) = – \ln 1.0 – \frac{1}{2} \ln (2 \pi) – \frac{1}{2} \Big( \frac{8.0-10.0}{1.0} \Big)^2$$

We can calculate the log likelihood from the *y_predict* distribution and the *y_actual* values. Of course, TensorFlow Probability does this for us by calling the *log_prob *method on the *y_predict* distribution. Taking the negative of this calculation, as I have done in the function above, gives us the negative log likelihood value that we need to minimize to perform MLE.

After the loss function, it is now time to compile the model, train it, and make some predictions:

model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.05), loss=neg_log_likelihood) model.fit(x_range, lin_reg, epochs=500, verbose=False) yhat = model(x_range) mean = yhat.mean()

As can be observed, the model is compiled using our custom *neg_log_likelihood* function as the loss. Because this is just a toy example, I am using the full dataset as both the train and test set. The estimated regression line is simply the mean of all the predicted distributions, and plotting it produces the following:

plt.close("all") plt.scatter(x_range, lin_reg) plt.plot(x_range, mean, label='predicted') plt.plot(x_range, x_range * grad + intercept, label='ground truth') plt.legend(loc="upper left") plt.show()

Another, more interesting, example is to use the model to predict not only the mean but also the changing variance of a dataset. In this example, the dataset consists of the same trend but the noise variance increases along with the *x *values:

def noise(x, grad=0.5, const=2.0): return np.random.normal(0, grad * x + const) x_range = np.arange(0, 10, 0.1) noise = np.array(list(map(noise, x_range))) grad = 2.0 intercept = 3.0 lin_reg = x_range * grad + intercept + noise plt.scatter(x_range, lin_reg) plt.show()

The new model looks like the following:

model = tf.keras.Sequential([ tf.keras.layers.Dense(2), tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x[:, 0], scale=1e-3 + tf.math.softplus(0.3 * x[:, 1]))), ])

In this case, we have two nodes in the first layer, ostensibly to predict both the mean and standard deviation of the Normal distribution, instead of just the mean as in the last example. The mean of the distribution is assigned to the output of the first node (x[:, 0]) and the standard deviation / scale is set to be equal to a softplus function based on the output of the second node (x[:, 1]). After training this model on the same data and using the same loss as the previous example, we can predict both the mean and standard deviation of the model like so:

mean = yhat.mean() upper = mean + 2 * yhat.stddev() lower = mean - 2 * yhat.stddev()

In this case, the upper and lower variables are the 2-standard deviation upper and lower bounds of the predicted distributions. Plotting this produces:

plt.close("all") plt.scatter(x_range, lin_reg) plt.plot(x_range, mean, label='predicted') plt.fill_between(x_range, lower, upper, alpha=0.1) plt.plot(x_range, x_range * grad + intercept, label='ground truth') plt.legend(loc="upper left") plt.show()

As can be observed, the model is successfully predicting the increasing variance of the dataset, along with the mean of the trend. This is a limited example of the power of TensorFlow Probability, but in future posts I plan to show how to develop more complicated applications like Bayesian Neural Networks. I hope this post has been useful for you in getting up to speed in topics such as conditional probability, Bayes Theorem, the prior, posterior and likelihood function, maximum likelihood estimation and a quick introduction to TensorFlow Probability. Look out for future posts expanding on the increasingly important probabilistic side of machine learning.

This post will review the REINFORCE or Monte-Carlo version of the Policy Gradient methodology. This methodology will be used in the Open AI gym Cartpole environment. All code used and explained in this post can be found on this site’s Github repository.

This section will review the theory of Policy Gradients, and how we can use them to train our neural network for deep reinforcement learning. This section will feature a fair bit of mathematics, but I will try to explain each step and idea carefully for those who aren’t as familiar with the mathematical ideas. We’ll also skip over a step at the end of the analysis for the sake of brevity.

In Policy Gradient based reinforcement learning, the objective function which we are trying to maximise is the following:

$$J(\theta) = \mathbb{E}_{\pi_\theta} \left[\sum_{t=0}^{T-1} \gamma^t r_t \right]$$

The function above means that we are attempting to find a policy ($\pi$) with parameters ($\theta$) which maximises the expected value of the sum of the discounted rewards of an agent in an environment. Therefore, we need to find a way of varying the parameters of the policy $\theta$ such that the expected value of the discounted rewards are maximised. In the deep reinforcement learning case, the parameters $\theta$ are the parameters of the neural network.

Note the difference to the deep Q learning case – in deep Q based learning, the parameters we are trying to find are those that minimise the difference between the actual Q values (drawn from experiences) and the Q values predicted by the network. However, in Policy Gradient methods, the neural network directly determines the actions of the agent – usually by using a softmax output and sampling from this.

Also note that, because environments are usually non-deterministic, under any given policy ($\pi_\theta$) we are not always going to get the same reward. Rather, we are going to be sampling from some probability function as the agent operates in the environment, and therefore we are trying to maximise the *expected *sum of rewards, not the 100% certain, we-will-get-this-every-time reward sum.

Ok, so we want to learn the optimal $\theta$. The way we generally learn parameters in deep learning is by performing some sort of gradient based search of $\theta$. So we want to iteratively execute the following:

$$\theta \leftarrow \theta + \alpha \nabla J(\theta)$$

So the question is, how do we find $\nabla J(\theta)$?

First, let’s make the expectation a little more explicit. Remember, the expectation of the value of a function $f(x)$ is the summation of all the possible values due to variations in *x* multiplied by the probability of *x*, like so:

$$\mathbb{E}[f(x)] = \sum_x P(x)f(x)$$

Ok, so what does the cashing out of the expectation in $J(\theta)$ look like? First, we have to define the function which produces the rewards, i.e. the rewards equivalent of $f(x)$ above. Let’s call this $R(\tau)$ (where

$R(\tau) = \sum_{t=0}^{T-1}r_t$, ignoring discounting for the moment). The value $\tau$ is the *trajectory* of the agent “moving” through the environment. It can be defined as:

$$\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$$

The trajectory, as can be seen, is the progress of the agent through an episode of a game of length *T*. This trajectory is the fundamental factor which determines the sum of the rewards – hence $R(\tau)$. This covers the $f(x)$ component of the expectation definition. What about the $P(x)$ component? In this case, it is equivalent to $P(\tau)$ – but what does this actually look like in a reinforcement learning environment?

It consists of the two components – the probabilistic policy function which yields an action $a_t$ from states $s_t$ with a certain probability, and a probability that state $s_{t+1}$ will result from taking action $a_t$ from state $s_t$. The latter probabilistic component is uncertain due to the random nature of many environments. These two components operating together will “roll out” the trajectory of the agent $\tau$.

$P(\tau)$ looks like:

$$P(\tau) = \prod_{t=0}^{T-1} P_{\pi_{\theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)$$

If we take the first step, starting in state $s_0$ – our neural network will produce a softmax output with each action assigned a certain probability. The action is then selected by weighted random sampling subject to these probabilities – therefore, we have a probability of action $a_0$ being selected according to $P_{\pi_{\theta}}(a_t|s_t)$. This probability is determined by the policy $\pi$ which in turn is parameterised according to $\theta$ (i.e. a neural network with weights $\theta$. The next term will be $P(s_1|s_0,a_0)$ which expresses any non-determinism in the environment.

(Note: the vertical line in the probability functions above are conditional probabilities. $P_{\pi_{\theta}}(a_t|s_t)$ refers to the probability of action $a_t$ being selected, *given* the agent is in state $s_t$).

These probabilities are multiplied out over all the steps in the episode of length *T* to produce the trajectory $\tau$. (Note, the probability of being in the first state, $s_0$, has been excluded from this analysis for simplicity). Now we can substitute $P(\tau)$ and $R(\tau)$ into the original expectation and take the derivative to get to $\nabla J(\theta)$ which is what we need to do the gradient based optimisation. However, to get there, we first need to apply a trick or two.

First, let’s take the log derivative of $P(\tau)$ with respect to $\theta$ i.e. $\nabla_\theta$ and work out what we get:

$$\nabla_\theta \log P(\tau) = \nabla \log \left(\prod_{t=0}^{T-1} P_{\pi_{\theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)\right) $$

$$ =\nabla_\theta \left[\sum_{t=0}^{T-1} (\log P_{\pi_{\theta}}(a_t|s_t) + \log P(s_{t+1}|s_t,a_t)) \right]$$

$$ =\nabla_\theta \sum_{t=0}^{T-1}\log P_{\pi_{\theta}}(a_t|s_t)$$

The reason we are taking the log will be made clear shortly. As can be observed, when the log is taken of the multiplicative operator ($\prod$) this is converted to a summation (as multiplying terms within a log function is equivalent to adding them separately). In the final line, it can be seen that taking the derivative with respect to the parameters ($\theta$) removes the dynamics of the environment ($\log P(s_{t+1}|s_t,a_t))$) as these are independent of the neural network parameters / $\theta$.

Let’s go back to our original expectation function, substituting in our new trajectory based functions, and apply the derivative (again ignoring discounting for simplicity):

$$J(\theta) = \mathbb{E}[R(\tau)]$$

$$ = \smallint P(\tau) R(\tau)$$

$$\nabla_\theta J(\theta) = \nabla_\theta \smallint P(\tau) R(\tau)$$

So far so good. Now, we are going to utilise the following rule which is sometimes called the “log-derivative” trick:

$$\frac{\nabla_\theta p(X,\theta)}{p(X, \theta)} = \nabla_\theta \log p(X,\theta)$$

We can then apply the $\nabla_{\theta}$ operator within the integral, and cajole our equation so that we get the $\frac{\nabla_{\theta} P{\tau}}{P(\tau)}$ expression like so:

$$\nabla_\theta J(\theta)=\int P(\tau) \frac{\nabla_\theta P(\tau)}{P(\tau)} R(\tau)$$

Then, using the log-derivative trick and applying the definition of expectation, we arrive at:

$$\nabla_\theta J(\theta)=\mathbb{E}\left[R(\tau) \nabla_\theta logP(\tau)\right]$$

We can them substitute our previous derivation of $\nabla_{\theta} log P(\tau)$ into the above to arrive at:

$$\nabla_\theta J(\theta) =\mathbb{E}\left[R(\tau) \nabla_\theta \sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right]$$

This is now close to the point of being something we can work with in our learning algorithm. Let’s take it one step further by recognising that, during our learning process, we are randomly sampling trajectories from the environment, and hoping to make informed training steps. Therefore, we can recognise that, to maximise the expectation above, we need to maximise it with respect to its argument i.e. we maximise:

$$\nabla_\theta J(\theta) \sim R(\tau) \nabla_\theta \sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)$$

Recall that $R(\tau)$ is equal to $R(\tau) = \sum_{t=0}^{T-1}r_t$ (ignoring discounting). Therefore, we have two summations that need to be multiplied out, element by element. It turns out that after doing this, we arrive at an expression like so:

As can be observed, there are two main components that need to be multiplied. However, one should note the differences in the bounds of the summation terms in the equation above – these will be explained in the next section.

The way we compute the gradient as expressed above in the REINFORCE method of the Policy Gradient algorithm involves sampling trajectories through the environment to estimate the expectation, as discussed previously. This REINFORCE method is therefore a kind of Monte-Carlo algorithm. Let’s consider this a bit more concretely.

Let’s say we initialise the agent and let it play a trajectory $\tau$ through the environment. The actions of the agent will be selected by performing weighted sampling from the softmax output of the neural network – in other words, we’ll be sampling the action according to $P_{\pi_{\theta}}(a_t|s_t)$. At each step in the trajectory, we can easily calculate $log P_{\pi_{\theta}}(a_t|s_t)$ by simply taking the *log* of the softmax output values from the neural network. So, for the first step in the trajectory, the neural network would take the initial states $s_0$ as input, and it would produce a vector of actions $a_0$ with pseudo-probabilities generated by the softmax operation in the final layer.

What about the second part of the $\nabla_\theta J(\theta)$ equation – $\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’}$? We can see that the summation term starts at $t’ = t + 1 = 1$. The summation then goes from t=1 to the total length of the trajectory *T* – in other words, from t=1 to the *total length of the episode*. Let’s say the episode length was 4 states long – this term would then look like $\gamma^0 r_1 + \gamma^1 r_2 + \gamma^2 r_3$, where $\gamma$ is the discounting factor and is < 1.

Straight-forward enough. However, you may have realised that, in order to calculate the gradient $\nabla_\theta J(\theta)$ at the first step in the trajectory/episode, we need to know the reward values of *all subsequent states* in the episode. Therefore, in order to execute this method of learning, we can only take gradient learning steps after the *full episode* has been played to completion. Only after the episode is complete can we perform the training step.

We are almost ready to move onto the code part of this tutorial. However, this is a good place for a quick discussion about how we would actually implement the calculations $\nabla_\theta J(\theta)$ equation in TensorFlow 2 / Keras. It turns out we can just use the standard cross entropy loss function to execute these calculations. Recall that cross entropy is defined as (for a deeper explanation of entropy, cross entropy, information and KL divergence, see this post):

$$CE = -\sum p(x) log(q(x))$$

Which is just the summation between one function $p(x)$ multiplied by the log of another function $q(x)$ over the possible values of the argument *x*. If we look at the source code of the Keras implementation of cross-entropy, we can see the following:

The *output* tensor here is simply the softmax output of the neural network, which, for our purposes, will be a tensor of size (num_steps_in_episode, num_actions). Note that the log of *output* is calculated in the above. The target value, for our purposes, can be all the discounted rewards calculated at each step in the trajectory, and will be of size (num_steps_in_episode, 1). The summation of the multiplication of these terms is then calculated (*reduce_sum*)*.* Gradient based training in TensorFlow 2 is generally a minimisation of the loss function, however, we want to maximise the calculation as discussed above. The good thing is, the sign of cross entropy calculation shown above is inverted – so we are good to go.

To call this training step utilising Keras, all we have to do is execute something like the following:

network.train_on_batch(states, discounted_rewards)

Here, we supply all the states gathered over the length of the episode, and the discounted rewards at each of those steps. The Keras backend will pass the states through *network,* apply the softmax function, and this will become the *output* variable in the Keras source code snippet above. Likewise, *discounted_rewards* is the same as *target* in the source code snippet above.

Now that we have covered all the pre-requisite knowledge required to build a REINFORCE-type method of Policy Gradient reinforcement learning, let’s have a look at how this can be coded and applied to the Cartpole environment.

In this section, I will detail how to code a Policy Gradient reinforcement learning algorithm in TensorFlow 2 applied to the Cartpole environment. As always, the code for this tutorial can be found on this site’s Github repository.

First, we define the network which we will use to produce $P_{\pi_{\theta}}(a_t|s_t)$ with the state as the input:

GAMMA = 0.95 env = gym.make("CartPole-v0") state_size = 4 num_actions = env.action_space.n network = keras.Sequential([ keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()), keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()), keras.layers.Dense(num_actions, activation='softmax') ]) network.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.Adam())

As can be observed, first the environment is initialised. Next, the network is defined using the Keras Sequential API. The network consists of 3 densely connected layers. The first 2 layers have ReLU activations, and the final layer has a softmax activation to produce the pseudo-probabilities to approximate $P_{\pi_{\theta}}(a_t|s_t)$. Finally, the network is compiled with a cross entropy loss function and an Adam optimiser.

The next part of the code chooses the action from the output of the model:

def get_action(network, state, num_actions): softmax_out = network(state.reshape((1, -1))) selected_action = np.random.choice(num_actions, p=softmax_out.numpy()[0]) return selected_action

As can be seen, first the softmax output is extracted from the network by inputing the current state. The action is then selected by making a random choice from the number of possible actions, with the probabilities weighted according to the softmax values.

The next function is the main function involved in executing the training step:

def update_network(network, rewards, states, actions, num_actions): reward_sum = 0 discounted_rewards = [] for reward in rewards[::-1]: # reverse buffer r reward_sum = reward + GAMMA * reward_sum discounted_rewards.append(reward_sum) discounted_rewards.reverse() discounted_rewards = np.array(discounted_rewards) # standardise the rewards discounted_rewards -= np.mean(discounted_rewards) discounted_rewards /= np.std(discounted_rewards) states = np.vstack(states) loss = network.train_on_batch(states, discounted_rewards) return loss

First, the discounted rewards list is created: this is a list where each element corresponds to the summation from t + 1 to T according to $\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’}$. The input argument *rewards* is a list of all the rewards achieved at each step in the episode. The *rewards[::-1]* operation reverses the order of the rewards list, so the first run through the *for *loop will deal with last reward recorded in the episode. As can be observed, a reward sum is accumulated each time the *for *loop is executed. Let’s say that the episode length is equal to 4 – $r_3$ will refer to the last reward recorded in the episode. In this case, the discounted_rewards list would look like:

[$r_3$, $r_2 + \gamma r_3$, $r_1 + \gamma r_2 + \gamma^2 r_3$, $r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3$]

This list is in reverse to the order of the actual state value list (i.e. [$s_0$, $s_1$, $s_2$, $s_3$]), so the next line after the *for *loop reverses the list *(discounted_rewards.reverse()).*

Next, the list is converted into a numpy array, and the rewards are normalised to reduce the variance in the training. Finally, the states list is stacked into a numpy array and both this array and the discounted rewards array are passed to the Keras *train_on_batch *function, which was detailed earlier.

The next part of the code is the main episode and training loop:

num_episodes = 10000000 train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}") for episode in range(num_episodes): state = env.reset() rewards = [] states = [] actions = [] while True: action = get_action(network, state, num_actions) new_state, reward, done, _ = env.step(action) states.append(state) rewards.append(reward) actions.append(action) if done: loss = update_network(network, rewards, states, actions, num_actions) tot_reward = sum(rewards) print(f"Episode: {episode}, Reward: {tot_reward}, avg loss: {loss:.5f}") with train_writer.as_default(): tf.summary.scalar('reward', tot_reward, step=episode) tf.summary.scalar('avg loss', loss, step=episode) break state = new_state

As can be observed, at the beginning of each episode, three lists are created which will contain the state, reward and action values for each step in the episode / trajectory. These lists are appended to until the *done* flag is returned from the environment signifying that the episode is complete. At the end of the episode, the training step is performed on the *network *by running *update_network*. Finally, the rewards and loss are logged in the *train_writer* for viewing in TensorBoard.

The training results can be observed below:

As can be observed, the rewards steadily progress until they “top out” at the maximum possible reward summation for the Cartpole environment, which is equal to 200. However, the user can verify that repeated runs of this version of Policy Gradient training has a high variance in its outcomes. Therefore, improvements in the Policy Gradient REINFORCE algorithm are required and available – these improvements will be detailed in future posts.

]]>

Standard versions of experience replay in deep Q learning consist of storing experience-tuples of the agent as it interacts with it’s environment. These tuples generally include the state, the action the agent performed, the reward the agent received and the subsequent action. These tuples are generally stored in some kind of experience buffer of a certain finite capacity. During the training of the deep Q network, batches of prior experience are extracted from this memory. Importantly, the samples in these training batches are extracted randomly and *uniformly* across the experience history.

Prioritised experience replay is an optimisation of this method. The intuition behind prioritised experience replay is that every experience is not equal when it comes to productive and efficient learning of the deep Q network. Consider a past experience in a game where the network already accurately predicts the Q value for that action. This experience sample, passed through the training process, will yield little in the way of improvements of the predictive capacity of the network. However, another experience sample may result in a poor estimation of the actual Q value at this state – this signifies that there is something valuable to learn about the experience and the algorithm should be encouraged to sample it.

What should the measure be to “rank” the sampling used in Prioritised Experience Replay? The most obvious answer is the difference between the predicted Q value, and what the Q value *should be* in that state and for that action. This difference is called the TD error, and looks like this (for a Double Q type network, see this post for more details):

$$\delta_i = r_{t} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t) – Q(s_{t}, a_{t}; \theta_t)$$

Here the left hand part of the equation is what the Q value should be (the target value): $r_{t} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t)$. The right hand part of the equation is what the Double Q network is actually predicting at the present time: $Q(s_{t}, a_{t}; \theta_t)$. The difference between these two quantities ($\delta_i$) is the “measure” of how much the network can learn from the given experience sample *i.* The higher the value, the more often this sample should be chosen.

Note, the notation above for the Double Q TD error features the $\theta_t$ and $\theta^-_t$ values – these are the weights corresponding to the primary and target networks, respectively. The *primary network* should be used to produce the right hand side of the equation above (i.e. $Q(s_{t}, a_{t}; \theta_t)$).

Often, to reduce the variance of $\delta_i$, the Huber loss function is used on this TD error. The Huber loss function will be used in the implementation below.

A common way of setting the priorities of the experience samples is by adding small constant to the TD error term like so:

$$p_i = | \delta_i | + \epsilon$$

This ensures that, even with samples which have a low $\delta_i$, they still have a small chance of being selected for sampling. Using these priorities, the discrete probability of drawing sample/experience *i* under Prioritised Experience Replay is:

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

Notice the $\alpha$ factor – this is a way of scaling the prioritisation based on the TD error up or down. If $\alpha = 0$ then all of the $p_i^{\alpha}$ terms go to 1 and every experience has the same chance of being selected, regardless of the TD error. Alternatively, if $\alpha = 1$ then “full prioritisation” occurs i.e. every sample is randomly selected proportional to its TD error (plus the constant). A commonly used $\alpha$ value is 0.6 – so that prioritisation occurs but it is not absolute prioritisation. This promotes some exploration in addition to the PER process.

Another aspect of Prioritised Experience Replay is a concept called Importance Sampling (IS). This part of Prioritised Experience Replay requires a bit of unpacking, for it is not intuitively obvious why it is required. When we are performing some Q based learning, we are trying to minimise the TD error by changing the model parameters $\theta$. However, we don’t have an exact function for the TD error based on all the possible states, actions and rewards in an environment. Instead, what we have are samples from the environment in the form of these experience tuples (states, actions, rewards).

Therefore, what we are really trying to minimise the *expected value* of the TD error, based on these samples. However, by drawing experience tuples based on the prioritisation discussed above, we are *skewing or biasing* this expected value calculation. This will lead to the result of us not actually solving the problem we are supposed to be solving during the training of the network.

A solution to this problem is to use something called *importance sampling*. There is more to IS, but, in this case, it is about applying weights to the TD error to try to correct the aforementioned bias. What does this look like? Here is an expression of the weights which will be applied to the loss values during training:

$$w_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\beta$$

This weight value will be multiplied by the TD error ($\delta_i$), which has the same effect as reducing the gradient step during training. These weights can “slow down” the learning of certain experience samples with respect to others. The variable *N* refers to the number of experience tuples already stored in your memory (and will top-out at the size of your memory buffer once it’s full). By looking at the equation, you can observe that the higher the probability of the sampling, the lower this weight value will be.

Because experience samples with a high priority / probability will be sampled more frequently under PER, this weight value ensures that the learning is slowed for these samples. This ensures that the training is not “overwhelmed” by the frequent sampling of these higher priority / probability samples and therefore acts to correct the aforementioned bias.

The $\beta$ value is generally initialised between 0.4 and 0.6 at the start of training and is annealed towards 1 at the end of the training. Because the value within the bracket is always < 1, a $\beta$ of < 1 will actually increase the weight values towards 1 and reduce the effect of these weight values. The authors of the original paper argue that at the beginning of the training, the learning is chaotic and the bias caused by the prioritisation doesn’t matter much anyway. It is only towards the end of the training that this bias needs to be corrected, so the $\beta$ value being closer to 1 decreases the weights for high priority / probability samples and therefore corrects the bias more. Note that in practice these weights $w_i$ in each training batch are rescaled so that they range between 0 and 1.

In order to sample experiences according to the prioritisation values, we need some way of organising our memory buffer so that this sampling is efficient. One feasible way of sampling is to create a cumulative sum of all the prioritisation values, and then sample from a uniform distribution of interval (0, max(cumulative_prioritisation)). This will result in sampling which is appropriately weighted according to the prioritisation values i.e. according to $P(i)$. However, this method of sampling requires an iterative search through the cumulative sum until the random value is greater than the cumulative value – this will then be the selected sample.

This is fine for small-medium sized datasets, however for very large datasets such as the memory buffer in deep Q learning (which can be millions of entries long), this is an inefficient way of selecting prioritised samples. The alternative to this method of sampling is the SumTree data structure and algorithms. The SumTree structure won’t be reviewed in this post, but the reader can look at my comprehensive post here on how it works and how to build such a structure.

That concludes the theory component of Prioritised Experience Replay and now we can move onto what the code looks like.

The code below will demonstrate how to implement Prioritised Experience Replay in TensorFlow 2. The code for this example can be found on this site’s Github repo. Please note that this code will heavily utilise code from one of my previous tutorials on Dueling Q learning in Atari environments and common code components will not be explained in detail in this post. The reader can go back to that post if they wish to review the intricacies of Dueling Q learning and using it in the Atari environment. This example will be demonstrated in the Space Invaders Atari OpenAI environment. The first part of the code can be observed below:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard' MAX_EPSILON = 1 MIN_EPSILON = 0.1 EPSILON_MIN_ITER = 500000 GAMMA = 0.99 BATCH_SIZE = 32 TAU = 0.08 POST_PROCESS_IMAGE_SIZE = (105, 80, 1) DELAY_TRAINING = 50000 BETA_DECAY_ITERS = 500000 MIN_BETA = 0.4 MAX_BETA = 1.0 NUM_FRAMES = 4 GIF_RECORDING_FREQ = 100 MODEL_SAVE_FREQ = 100 env = gym.make("SpaceInvaders-v0") num_actions = env.action_space.n def huber_loss(loss): return 0.5 * loss ** 2 if abs(loss) < 1.0 else abs(loss) - 0.5 class DQModel(keras.Model): def __init__(self, hidden_size: int, num_actions: int, dueling: bool): super(DQModel, self).__init__() self.dueling = dueling self.conv1 = keras.layers.Conv2D(16, (8, 8), (4, 4), activation='relu') self.conv2 = keras.layers.Conv2D(32, (4, 4), (2, 2), activation='relu') self.flatten = keras.layers.Flatten() self.adv_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.adv_out = keras.layers.Dense(num_actions, kernel_initializer=keras.initializers.he_normal()) if dueling: self.v_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal()) self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x)) self.combine = keras.layers.Add() def call(self, input): x = self.conv1(input) x = self.conv2(x) x = self.flatten(x) adv = self.adv_dense(x) adv = self.adv_out(adv) if self.dueling: v = self.v_dense(x) v = self.v_out(v) norm_adv = self.lambda_layer(adv) combined = self.combine([v, norm_adv]) return combined return adv primary_network = DQModel(256, num_actions, True) target_network = DQModel(256, num_actions, True) primary_network.compile(optimizer=keras.optimizers.Adam(), loss=tf.keras.losses.Huber()) # make target_network = primary_network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(e)

First, the reader can see that various constants are declared. Following this, a custom Huber loss function is declared, this will be used later in the code. Next, a custom Keras model is created which instantiates a Dueling Q architecture – again, refer to my previous post for more details on this. Finally, a primary and target network are created to perform Double Q learning, and the target and primary network weights are set to be equal.

After this declaration, the SumTree data structures and functions are developed. For more details, as stated previously, see my SumTree post.

Next, the Memory class is created:

class Memory(object): def __init__(self, size: int): self.size = size self.curr_write_idx = 0 self.available_samples = 0 self.buffer = [(np.zeros((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1]), dtype=np.float32), 0.0, 0.0, 0.0) for i in range(self.size)] self.base_node, self.leaf_nodes = create_tree([0 for i in range(self.size)]) self.frame_idx = 0 self.action_idx = 1 self.reward_idx = 2 self.terminal_idx = 3 self.beta = 0.4 self.alpha = 0.6 self.min_priority = 0.01

The input arguments to the class is the size of the memory buffer (i.e. how many experience tuples it will hold). The curr_write_idx variable designates the current position in the buffer to place new experience tuples. If this value exceeds the size of the buffer, it is reset back to the beginning of the buffer -> 0. The available_samples variable is a measure of how many samples have been placed in the buffer. Once the buffer has been filled for the first time, this variable will be equal to the *size* variable.

Next, the experience buffer is initialized with zeros. It is important that you initialize this buffer at the beginning of the training, as you will be able to instantly determine whether your machine has enough memory to handle the size of this buffer. If you don’t initialize but dynamically append to a list, you run the risk of exceeding the memory of your machine half way through your training – which can be frustrating during long training runs!

The next line involves the creation of the SumTree object. The SumTree is initialised with the number of leaf nodes equal to the size of the buffer, and with a value of 0. Again, for more details on the SumTree object, see this post. The following variable declarations (frame_idx, action_idx, reward_idx and terminal_idx) specify what tuple indices relate to each of the variable types stored in the buffer. Finally, the IS $\beta$ and the PER $\alpha$ values are initialised with values previously discussed above, and the minimum priority value to add to each experience tuple is defined as some small float value.

The next method in the Memory class appends a new experience tuple to the buffer and also updates the priority value in the SumTree:

def append(self, experience: tuple, priority: float): self.buffer[self.curr_write_idx] = experience self.update(self.curr_write_idx, priority) self.curr_write_idx += 1 # reset the current writer position index if creater than the allowed size if self.curr_write_idx >= self.size: self.curr_write_idx = 0 # max out available samples at the memory buffer size if self.available_samples + 1 < self.size: self.available_samples += 1 else: self.available_samples = self.size - 1 def update(self, idx: int, priority: float): update(self.leaf_nodes[idx], self.adjust_priority(priority)) def adjust_priority(self, priority: float): return np.power(priority + self.min_priority, self.alpha)

Here you can observe that both the experience tuple (state, action, reward, terminal) and the priority of this experience are passed to this method. The experience tuple is written to the buffer at *curr_write_idx* and the priority is sent to the *update *method of the class. The *update* method of the Memory class in turn calls the SumTree update function which is outside this class. Notice that the “raw” priority is not passed to the SumTree update, but rather the “raw” priority is first passed to the *adjust_priority* method. This method adds the minimum priority factor and then raises the priority to the power of $\alpha$ i.e. it performs the following calculations:

$$p_i = | \delta_i | + \epsilon$$

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

After the experience tuple is added, the current write index is incremented. If the current write index now exceeds the size of the buffer, it is reset back to 0 to start overwriting old experience tuples. Next, the *available_samples* value is incremented, but only if it is less than the size of the memory, otherwise it is clipped at the size of the memory. Next is the (rather complicated) sample method:

def sample(self, num_samples: int): sampled_idxs = [] is_weights = [] sample_no = 0 while sample_no < num_samples: sample_val = np.random.uniform(0, self.base_node.value) samp_node = retrieve(sample_val, self.base_node) if NUM_FRAMES - 1 < samp_node.idx < self.available_samples - 1: sampled_idxs.append(samp_node.idx) p = samp_node.value / self.base_node.value is_weights.append((self.available_samples + 1) * p) sample_no += 1 # apply the beta factor and normalise so that the maximum is_weight < 1 is_weights = np.array(is_weights) is_weights = np.power(is_weights, -self.beta) is_weights = is_weights / np.max(is_weights) # now load up the state and next state variables according to sampled idxs states = np.zeros((num_samples, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES), dtype=np.float32) next_states = np.zeros((num_samples, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES), dtype=np.float32) actions, rewards, terminal = [], [], [] for i, idx in enumerate(sampled_idxs): for j in range(NUM_FRAMES): states[i, :, :, j] = self.buffer[idx + j - NUM_FRAMES + 1][self.frame_idx][:, :, 0] next_states[i, :, :, j] = self.buffer[idx + j - NUM_FRAMES + 2][self.frame_idx][:, :, 0] actions.append(self.buffer[idx][self.action_idx]) rewards.append(self.buffer[idx][self.reward_idx]) terminal.append(self.buffer[idx][self.terminal_idx]) return states, np.array(actions), np.array(rewards), next_states, np.array(terminal), sampled_idxs, is_weights

The purpose of this method is to perform priority sampling of the experience buffer, but also to calculate the importance sampling weights for use in the training steps. The first step is a while loop which iterates until *num_samples* have been sampled. This sampling is performed by selecting a uniform random number between 0 and the base node value of the SumTree. This sample value is then retrieved from the SumTree data structure according to the stored priorities.

A check is then made to ensure that the sampled index is valid and if so it is appended to a list of sampled indices. After this appending, the $P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$ value is calculated. The SumTree base node value is actually the sum of all priorities of samples stored to date. Also recall that the $\alpha$ value has already been applied to all samples as the “raw” priorities are added to the SumTree.

Now, the IS weights are calculated according to the following:

$$w_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\beta$$

This can alternatively be expressed as:

$$w_i = \left( N \cdot P(i) \right)^{-\beta}$$

On the next line of the code, the following values are appended to the *is_weights *list: $\left( N \cdot P(i) \right)$. Following the accumulation of the samples, the IS weights are then converted from a list to a numpy array, then each value is raised element-wise to the power of $-\beta$. Following this, the IS weights are then normalised so that they span between 0 and 1, which acts to stabilise learning.

Next, the *states* and *next_states *arrays are initialised – in this case, these arrays will consist of 4 stacked frames of images for each training sample. For more explanation on training in an Atari environment with stacked frames – see this post. Finally, these frame / state arrays, associated rewards and terminal states, and the IS weights are returned from the method.

That concludes the explanation of the rather complicated Memory class.

Next we initialise the Memory class and declare a number of other ancillary functions (which have already been discussed here).

memory = Memory(200000) def image_preprocess(image, new_size=(105, 80)): # convert to greyscale, resize and normalize the image image = tf.image.rgb_to_grayscale(image) image = tf.image.resize(image, new_size) image = image / 255 return image def choose_action(state, primary_network, eps, step): if step < DELAY_TRAINING: return random.randint(0, num_actions - 1) else: if random.random() < eps: return random.randint(0, num_actions - 1) else: return np.argmax(primary_network(tf.reshape(state, (1, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)).numpy())) def update_network(primary_network, target_network): # update target network parameters slowly from primary network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(t * (1 - TAU) + e * TAU) def process_state_stack(state_stack, state): for i in range(1, state_stack.shape[-1]): state_stack[:, :, i - 1].assign(state_stack[:, :, i]) state_stack[:, :, -1].assign(state[:, :, 0]) return state_stack def record_gif(frame_list, episode, fps=50): imageio.mimsave(STORE_PATH + "\\SPACE_INVADERS_EPISODE-eps{}-r{}.gif".format(episode, reward), frame_list, fps=fps) #duration=duration_per_frame)ation_per_frame)

The next function calculates the target Q values for training (see this post for details on Double Q learning) and also calculates the $\delta(i)$ priority values for each sample:

def get_per_error(states, actions, rewards, next_states, terminal, primary_network, target_network): # predict Q(s,a) given the batch of states prim_qt = primary_network(states) # predict Q(s',a') from the evaluation network prim_qtp1 = primary_network(next_states) # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action target_q = prim_qt.numpy() # the action selection from the primary / online network prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1) # the q value for the prim_action_tp1 from the target network q_from_target = target_network(next_states) updates = rewards + (1 - terminal) * GAMMA * q_from_target.numpy()[:, prim_action_tp1] target_q[:, actions] = updates # calculate the loss / error to update priorites error = [huber_loss(target_q[i, actions[i]] - prim_qt.numpy()[i, actions[i]]) for i in range(states.shape[0])] return target_q, error

The first part of the function and how it works to estimate the target Q values has been discussed in previous posts (see here). The new features in this Prioritised Experience Replay example is the calculation of *error*. The value that is calculated on this line is the TD error, but the TD error passed through a Huber loss function:

$$\delta_i = r_{t} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t) – Q(s_{t}, a_{t}; \theta_t)$$

Which is the same as:

$$\delta_i = target_q – Q(s_{t}, a_{t}; \theta_t)$$

Note that $Q(s_{t}, a_{t}; \theta_t)$ is extracted from the primary network (with weights of $\theta_t$). Both the target Q values and the Huber error / $\delta_i$ are returned from this function.

The next function uses the *get_per_error *function just reviewed, updates the priority values for these samples in the memory, and also trains the primary network:

def train(primary_network, memory, target_network): states, actions, rewards, next_states, terminal, idxs, is_weights = memory.sample(BATCH_SIZE) target_q, error = get_per_error(states, actions, rewards, next_states, terminal, primary_network, target_network) for i in range(len(idxs)): memory.update(idxs[i], error[i]) loss = primary_network.train_on_batch(states, target_q, is_weights) return loss

As can be observed, first a batch of samples are extracted from the memory. Next the target_q and Huber loss TD errors are calculated. For each memory index, the error is passed to the Memory update method. Note that, every time a sample is drawn from memory and used to train the network, the new TD errors calculated in that process are passed back to the memory so that the priority of these samples are then updated. This ensures that samples with TD errors which were once high (and were therefore valuable due to the fact that the network was not predicting them well) but are now low (due to network training) will no longer be sampled as frequently.

Finally, the primary network is trained on the batch of states and target Q values. Note that a third argument is passed to the Keras *train_on_batch* function – the importance sampling weights. The Keras *train_on_batch* function has an optional argument which applies a multiplicative weighting factor to each loss value – this is exactly what we need to apply the IS adjustment to the loss values.

The code following is the main training / episode loop:

num_episodes = 1000000 eps = MAX_EPSILON render = False train_writer = tf.summary.create_file_writer(STORE_PATH + "/DuelingQPERSI_{}".format(dt.datetime.now().strftime('%d%m%Y%H%M'))) steps = 0 for i in range(num_episodes): state = env.reset() state = image_preprocess(state) state_stack = tf.Variable(np.repeat(state.numpy(), NUM_FRAMES).reshape((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES))) cnt = 1 avg_loss = 0 tot_reward = 0 if i % GIF_RECORDING_FREQ == 0: frame_list = [] while True: if render: env.render() action = choose_action(state_stack, primary_network, eps, steps) next_state, reward, done, info = env.step(action) tot_reward += reward if i % GIF_RECORDING_FREQ == 0: frame_list.append(tf.cast(tf.image.resize(next_state, (480, 320)), tf.uint8).numpy()) next_state = image_preprocess(next_state) old_state_stack = state_stack state_stack = process_state_stack(state_stack, next_state) if steps > DELAY_TRAINING: loss = train(primary_network, memory, target_network) update_network(primary_network, target_network) _, error = get_per_error(tf.reshape(old_state_stack, (1, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)), np.array([action]), np.array([reward]), tf.reshape(state_stack, (1, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)), np.array([done])) # store in memory memory.append((next_state, action, reward, done), error[0]) else: loss = -1 # store in memory - default the priority to the reward memory.append((next_state, action, reward, done), reward) avg_loss += loss # linearly decay the eps and PER beta values if steps > DELAY_TRAINING: eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * \ (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else \ MIN_EPSILON beta = MIN_BETA + ((steps - DELAY_TRAINING) / BETA_DECAY_ITERS) * \ (MAX_BETA - MIN_BETA) if steps < BETA_DECAY_ITERS else \ MAX_BETA memory.beta = beta steps += 1 if done: if steps > DELAY_TRAINING: avg_loss /= cnt print("Episode: {}, Reward: {}, avg loss: {:.5f}, eps: {:.3f}".format(i, tot_reward, avg_loss, eps)) with train_writer.as_default(): tf.summary.scalar('reward', tot_reward, step=i) tf.summary.scalar('avg loss', avg_loss, step=i) else: print("Pre-training...Episode: {}".format(i)) if i % GIF_RECORDING_FREQ == 0: record_gif(frame_list, i, tot_reward) break cnt += 1

This training loop has been explained in detail here, so please refer to that post for a detailed explanation. The first main difference to note is the linear increment from MIN_BETA to MAX_BETA (0.4 to 1.0) over BETA_DECAY_ITERS number of training steps – the purpose of this change in the $\beta$ value has been explained previously. The next major difference results from the need to feed a priority value into memory along with the experience tuple during each episode step. This is calculated by calling the *get_per_error *function that was explained previously, and this error is passed to the memory *append* method. Before training of the network is actually started (i.e. episodes < DELAY_TRAINING), the reward is substituted for the priority in the memory.

This concludes the explanation of the code for this Prioritised Experience Replay example. Now let’s look at the results.

The graph below shows the progress of the rewards over ~1000 episodes of training in the Open AI Space Invader environment, using Prioritised Experience Replay:

This concludes my post introducing the important Prioritised Experience Replay concept. In future posts, I’ll deal with other types of reinforcement learning algorithms.

]]>

Let’s say we have a tuple of entries in a list, something like this:

[(214, 1),

(342, 4),

(42, 2)

(123, 3)]

The first element in each tuple is the value you want to sample, and the second element in the tuple is the weighting value which governs the frequency that each element is randomly sampled at. In the example above, we would expect the 342 value to be sampled 4 times as frequently as the 214 value. How would we perform this weighted sampling? A straight-forward way is to perform a cumulative sum, then perform sampling based on a uniform probability distribution.

Here is the same collection but with a cumulative sum “column” added:

[(214, 1, 1),

(342, 4, 5),

(42, 2, 7),

(123, 3, 10)]

To perform weighted sampling, once cumulative summing is performed, one can sample from a uniform distribution with a minimum of 0 and a maximum of 10 (the highest value in the cumulative sum) and sample the element which has a cumulative interval corresponding to the sample from the uniform distribution. For instance, if the random number extracted from the uniform distribution U(0, 10) is 0.5, then the first element in the sorted list would be sampled (value = 214). If the random number instead was 7.8, the sampled value would be 123.

As can be observed, the element with the weight = 4 takes up 40% of the total interval between 0 and 10, as opposed to only 10% of the element with a weight = 1. Therefore, sampling using this method will respect the weight values and proportion them accordingly.

So far so good. However, what happens if our collection is millions of entries long? This method has a time complexity of $O(n)$ during the sampling process – which means that the time it takes to sample an entry is proportional to the number of elements in the collection. Therefore, for collections with millions of entries, the computational cost of this method, given frequent sampling, can be significant. The SumTree algorithm / data structure can do better – it’s time complexity if $O(log n)$ which is significantly quicker.

The diagram below shows the SumTree for the collection shown above:

The first thing to note is that the “leaf” nodes of the tree (1, 4, 2, 3) correspond to the weights of the collection previously shown. The next thing to note is that the parents of each leaf node has a value equal to the sum of its children. So, for instance, the parent node of the 1 and 4 value leaf nodes has a value of 5. Likewise for the other parent node, and then the same summation occurs to the produce the value for the top value of the tree (10).

How does the data extraction work from such a tree? The top parent node in a SumTree has a value equal to the summation of *all* the leaf nodes of the tree. So the first step is to perform a uniform random sampling of a value between 0 and the value of the top-parent (i.e. in this case U(0, 10)). Let’s say this sampled value, in this case, is 3.5. Let’s assign this to a variable named *value*. The first retrieval step is to see if *value* is less than the left-hand child node. When this occurs we keep *value *the same and traverse to the left-hand child. Next, we do the same comparison – is *value* less than the left-hand child node (is *value *< 1)? In this case it isn’t, so we traverse to the right hand node (4). Whenever a right hand path is taken, *value* is adjusted by subtracting the node value of the left-hand path – so in this case, *value = *3.5 – 1 = 2.5. Because the right-hand node (4) is a leaf node, the search terminates and returns this node or index. The diagram below shows this process:

The diagram below shows the traversal path through the SumTree for a random value of 6.5:

As can be observed initially the right-hand child of the top parent is selected, so the *value* is decremented by the left-hand node value (6.5 – 5). On the second level, the left-hand child is selected and the algorithm would return the (2) leaf node.

By considering how the SumTree algorithm works, it can be seen how much more efficient it can be than iterating through a cumulative sum array until the correct interval is found. It is also quite easy to update the weight values of the leaf nodes and propagate the changes. All that needs to be done is to take the difference of the change and then add that difference to all upstream node parents. For instance, if the (2) value leaf node is increased to 5, the change is 3. Therefore, it’s parent (right-hand 5 node) would be increased to 8, and this parent’s parent (the top parent node – 10) would also be increased by 3 to 13.

This is an efficient data structure and algorithm, so let’s see how to implement a SumTree in Python code.

There are many different ways of implementing a SumTree in Python. The code below uses a class to define the structure, and uses recursive functions to both traverse and create the SumTree. Below shows the base class for the SumTree:

class Node: def __init__(self, left, right, is_leaf: bool = False, idx = None): self.left = left self.right = right self.is_leaf = is_leaf if not self.is_leaf: self.value = self.left.value + self.right.value self.parent = None self.idx = idx # this value is only set for leaf nodes if left is not None: left.parent = self if right is not None: right.parent = self @classmethod def create_leaf(cls, value, idx): leaf = cls(None, None, is_leaf=True, idx=idx) leaf.value = value return leaf

This class defines information about each node in the tree, but also contains the tree structure within the *left, right *and *parent* properties. The property *left* refers to the left-hand child of this node, and the property *right *refers to the right-hand child of this node. The property *parent* refers, obviously, to the parent of the current node (if it has one). Each of these properties will point to another Node instance. Note that the *value *property in the class initialization is initialized to be the sum of the values of the left and right child nodes.

The final part of the initialization also sets the parent node of the *left *and *right* child nodes to be equal to itself. The *create_leaf* class method, whose function it is to create all the leaf nodes of the tree, takes a weight value and an index as the arguments. In the first line, the leaf variable is defined as an instance of the Node class, with no parents specified. An *is_leaf* flag of the Node is set to *True*. This will be used later during the leaf node retrieval function. Finally, the leaf weight is set to the passed *value* and the leaf node object is returned.

The function below shows how the tree can be created:

def create_tree(input: list): nodes = [Node.create_leaf(v, i) for i, v in enumerate(input)] leaf_nodes = nodes while len(nodes) > 1: inodes = iter(nodes) nodes = [Node(*pair) for pair in zip(inodes, inodes)] return nodes[0], leaf_nodes

The input to this function is a List input which has all the weight values of the leaf nodes. For the simple example we have been working with, this list would be [1, 4, 2, 3]. The first line of the function creates a list of leaf Nodes by calling the *create_leaf *class method and supplying the passed list values and their respective indices. A *leaf_nodes* variable, which keeps track of which nodes are the leaves of the tree, is then created and set to be equal to this initial list of *nodes*.

Next, a more difficult to comprehend loop is entered into. On the first line of the loop, a Python iterator object is created from the *nodes* list. Note that when a Python iterator object is created from a list of objects like this, each time the iterator is called (via *next()* or some other operation which extracts the element in the list), that element or object is removed from the iterator. The next line creates a new *nodes* list of Node objects.

This line requires some unpacking. The zip function in the list comprehension zips together two instances of the iterator. When the “for *pair*” operates on this zip function, it first extracts one node from inodes, and then extracts the *next node* from inodes and zips them together. This essentially creates a tuple of *(node_1, node_2*) in the first round of the *for *loop, then *(node_3, node_4**)* is the second round of the *for *loop and so-on. These unpacked tuples i.e. (*node_1, node_2*) are then used in the list comprehension to create a new instance of Node i.e. the parent node of *node_1* and *node_2*.

In this example, the first time through the *while len(nodes) > 1* loop would create a list of 2 nodes – the first node being the parent of leaf nodes (1) and (4), and the second node being the parent of (2) and (3). The length of the list *nodes* after this first pass would now be 2. On the second pass, a single Node in the list would be created which is the parent of the parents of the leaf nodes, i.e. the parent of the (5) and (5) nodes. This is obviously the top-parent node with a weight value of 10. Because the top-parent or root node has been reached, the length of *nodes* is no longer > 1 and therefore the while loop exits. The function returns the top-parent node (= *nodes[0]*) and the list of leaf nodes. Note, the above function and class definition borrows partially from one of the answers here.

The next function to review is the leaf node retrieval function:

def retrieve(value: float, node: Node): if node.is_leaf: return node if node.left.value >= value: return retrieve(value, node.left) else: return retrieve(value - node.left.value, node.right)

The retrieve function takes a value, in our examples a uniformly sampled random value, and the top-parent node as the first arguments. This function is to be used in a recursive loop, so the first line tests to see if the *node *is actually the leaf node, meaning that the SumTree traversal has been completed. If this is the case, the function exits and returns the leaf node.

Next, the function checks to see if the current *value* is less than the left-hand child node value (*node.left.value*). If so, the function recursively calls itself and passes the *value* straight-through – in exactly the same manner as the walk-through example I presented above. It also passes the left child node as the second argument to itself. If the *value *is greater than or equal to the left-hand child node value, the function instead recursively calls itself by passing through the *value *minus the left-hand child node value, and the right-hand child node as arguments.

If you follow the logic of this recursive loop, you will see that the traversal through the SumTree structure, from the top-parent node to the leaf node, works exactly as was described in the walk-through example above. The next two functions are involved in the updating process of the weights of the leaf nodes, the changes of which are then propagated up through the SumTree:

def update(node: Node, new_value: float): change = new_value - node.value node.value = new_value propagate_changes(change, node.parent) def propagate_changes(change: float, node: Node): node.value += change if node.parent is not None: propagate_changes(change, node.parent)

In the first function the leaf node that needs to be changed is the first argument and the new weight value this leaf node should have is passed as the second argument. The first line of the function calculates the *change* from the current value. The next recursive function *propagate_changes* is then called – with the arguments being the *change* variable and the parent of the leaf node. In this function, first the parent node value is updated by *change*. Next, the function checks to see if this parent node *itself* has a parent node. If it doesn’t, that means that the top-parent node has been reached and all the changes have been propagated. If the current node *does* have a parent, the function calls itself and passes the current node’s parent to the function. In this way, the changes are propagated all the way up from the leaf nodes to the top-parent node.

These functions constitute the core functionality and data structure of the SumTree. In the next section, I’ll show how we can use it and confirm it is working as expected.

The code below shows how the SumTree is used and demonstrates its correct sampling characteristics:

input = [1, 4, 2, 3] root_node, leaf_nodes = create_tree(input) def demonstrate_sampling(root_node: Node): tree_total = root_node.value iterations = 1000000 selected_vals = [] for i in range(iterations): rand_val = np.random.uniform(0, tree_total) selected_val = retrieve(rand_val, root_node).value selected_vals.append(selected_val) return selected_vals selected_vals = demonstrate_sampling(root_node) # the below print statement should output ~4 print(f"Should be ~4: {sum([1 for x in selected_vals if x == 4]) / sum([1 for y in selected_vals if y == 1])}")

First, a simple *input* weight list is created which corresponds to the weights used in the example problem shown above. Next, the tree is created from these inputs using the *create_tree* function. As shown above, this function returns the top-parent node (called the *root_node* here) and a list of the leaf nodes. Next a function has been created to demonstrate how the sampling would work in practice. Note the only input argument is the top-parent/root node – this node contains within itself all the children nodes and associated connections. Within this function, a loop is then entered into with an arbitrarily large number of iterations.

In each iteration, a random value is sampled from a uniform distribution with a range between 0 and the *tree_total *– which is the same as the value of the top-parent node or* root_node*. This sampled value is then passed to the *retrieve *function (along with *root_node*) which returns the appropriate leaf node whose weight value is then extracted. This value is added to an accumulating list and finally is returned from the function.

The final print statement uses list comprehensions to count the number of times the leaf node with the weight value of 4 has been returned, compared to the number of times the leaf node with the weight value of 1 has been returned. The ratio should be ~4, and the readers can demonstrate that this is indeed the case by running the code. Finally, the second leaf node (with a current weight value of 4) is updated to have a value of 6 and the same process with associated ratio checks is performed:

update(leaf_nodes[1], 6) selected_vals = demonstrate_sampling(root_node) # the below print statement should output ~6 print(f"Should be ~6: {sum([1 for x in selected_vals if x == 6]) / sum([1 for y in selected_vals if y == 1])}") # the below print statement should output ~2 print(f"Should be ~2: {sum([1 for x in selected_vals if x == 6]) / sum([1 for y in selected_vals if y == 3])}")

Again, the reader can check that these ratios come out at values that they should. This concludes the introduction to the SumTree data structure and associated algorithms. In future posts, this algorithm and code will be used in Prioritised Experience Replay (PER) based memory functions for Q-based reinforcement learning, so stay tuned for that.

Double Q learning was created to address two problems with vanilla deep Q learning. These are:

- Using the same network to both choose the best action and evaluate the quality of that action is a source of feedback / learning instability.
- The
*max*function used in calculating the target Q value (see formula below), which the neural network is to learn, tends to bias the network towards high, noisy, rewards. This again hampers learning and makes it more erratic

The problematic Bellman equation is shown below: $$Q_{target} = r_{t+1} + \gamma \max_{{a}}Q(s_{t+1}, a;\theta_t)$$ The Double Q solution to the two problems above involves creating another *target* network, which is initially created with weights equal to the *primary* network. However, during training the primary network and the target network are allowed to “drift” apart. The primary network is trained as per usual, but the target network is not. Instead, the target network weights are either periodically (but not frequently) set equal to the primary network weights, or they are only gradually “blended” with the primary network in a weighted average fashion. The benefit then comes from the fact that in Double Q learning, the Q value of the best action in the next state ($s_{t + 1}$) is extracted from the *target* network, not the primary network. The primary network is still used to evaluate what the best action will be, a*, by taking an *argmax* of the outputs from the primary network, but the Q value for this action is evaluated from the *target* network. This can be observed in the formulation below: $$a* = argmax Q(s_{t+1}, a; \theta_t)$$ $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, a*; \theta^-_t)$$ Notice the different weights involved in the formulas above – the best action, a*, is calculated from the network with $\theta_t$ weights – this is the primary network weights. However the $Q_{target}$ calculation uses the *target* network, with weights $\theta^-_t$, to estimate the Q value for this chosen action. This Double Q methodology decouples the choosing of an action from the evaluation of the Q value of such an action. This provides more stability to the learning – for more details and a demonstration of the superiority of the Double Q methodology over vanilla Deep Q learning, see this post.

The Dueling Q architecture, discussed in detail in this post, is an improvement to the Double Q network. It uses the same methodology of a target and a primary network, with periodic updates or blending of the target network weights to the primary network weights. However, it builds two important concepts into the architecture of the network. These are the advantage and value functions:

**Advantage function A(s, a):**The advantage function is the relative benefit of choosing a certain action in state*s*over the other possible actions in state*s***Value function V(s):**The value function is the value of being in state*s, independent*of the relative benefits of the actions within that state

The Q function is the simple addition of these two functions: $$Q(s, a) = V(s) + A(s, a)$$ The motivation of splitting these two functions explicitly in the architecture is that there can be inherently good or bad states for the agent to be in, regardless of the relative benefit of any actions within that state. For instance, in a certain state, all actions may lead to the agent “dying” in a game – this is an inherently bad state to be in, and there is no need to waste computational resources trying to determine the best action in this state. The converse can also be true. Ideally, this “splitting” into the advantage function and value function should be learnt implicitly during training. However, the Dueling Q architecture makes this split explicit, which acts to improve training. The Dueling Q architecture can be observed in the figure below:

It can be observed that in the Dueling Q architecture, there are common Convolutional Neural Network layers which perform image processing. The output from these layers is then flattened and the network then bifurcates into a Value function stream V(s) and an Advantage function stream A(s, a). The output of these separate streams are then aggregated in a special layer, before finally outputting Q values from the network. The aggregation layer does not perform a simple addition of the Value and Advantage streams – this would result in problems of *identifiability *(for more details on this, see the original Dueling Q post). Instead, the following aggregation function is performed: $$Q(s,a) = V(s) + A(s,a) – \frac{1}{\|a\|}\sum_{a’}A(s,a’)$$ In this post, I’ll demonstrate how to use the Dueling Q architecture to train an agent in TensorFlow 2 to play Atari Space Invaders. However, in this post I will concentrate on the extra considerations required to train the agent via an image stream from an Atari game. For more extra details, again, refer to the original Dueling Q post.

Training reinforcement learning agents on Atari environments is *hard* – it can be a very time consuming process as the environment complexity is high, especially when the agent needs to visually interpret objects direct from images. As such, each environment needs to be considered to determine legitimate ways of reducing the training burden and improving the performance. Three methods will be used in this post:

- Converting images to greyscale
- Reducing the image size
- Stacking frames

The first, relatively easy, step in reducing the computational training burden is to convert all the incoming Atari images from depth-3 RGB colour images to depth-1 greyscale images. This reduces the number of input CNN filters required in the first layer by 3. Another step which can be performed to reduce the size of the input CNN filters is to resize the image inputs to make them smaller. There is obviously a limit in the reduction of the image sizes before learning performance is affected, however, in this case, a halving of the image size by rescaling is possible without affecting performance too much. The original image sizes from the Atari Space Invaders game are (210, 160, 3) – after converting to greyscale and resizing by half, the new image size is (105, 80, 1). Both of these operations are easy enough to implement in TensorFlow 2:

def image_preprocess(image, new_size=(105, 80)): # convert to greyscale, resize and normalize the image image = tf.image.rgb_to_grayscale(image) image = tf.image.resize(image, new_size) image = image / 255 return image

The next step that is commonly performed when training agents on Atari games is the practice of stacking image frames, and feeding all these frames into the input CNN layers. The purpose of this is to allow the neural network to get some sense of *direction *of the objects moving within the image. Consider a single, static image – examining such an image on its own will give no information about which direction any of the objects moving within this image are travelling (or their respective speeds). Therefore, for each sample fed into the neural network, a stack of frames is presented to the input – this gives the neural network both time and spatial information to work with. The input dimension to the network are not, then, of size (105, 80, 1) but rather (105, 80, NUM_FRAMES). In this case, we’ll use 3 frames to feed into the network i.e. NUM_FRAMES = 3. The specifics of how these stacked frames are stored, extracted and updated will be revealed as we step through the code in the next section. Additional steps can be taken to improve performance in complex Atari environment and similar cases. These include the skipping of frames and prioritised experience replay (PER). However, these have not been implemented in this example. A future post will discuss the benefits of PER and how to implement it.

The section below details the TensorFlow 2 implementation of training an agent on the Atari Space Invaders environment. In this post, comprehensive details of the Dueling Q architecture and training implementation will not be given – for a step by step discussion on these details, see my Dueling Q introductory post. However, detailed information will be given about the specific new steps required to train in the Atari environment. As stated at the beginning of the post, all code can be found on this site’s Github repository.

First we define the Double/Dueling Q model class with its structure:

env = gym.make("SpaceInvaders-v0") num_actions = env.action_space.n class DQModel(keras.Model): def __init__(self, hidden_size: int, num_actions: int, dueling: bool): super(DQModel, self).__init__() self.dueling = dueling self.conv1 = keras.layers.Conv2D(16, (8, 8), (4, 4), activation='relu') self.conv2 = keras.layers.Conv2D(32, (4, 4), (2, 2), activation='relu') self.flatten = keras.layers.Flatten() self.adv_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.adv_out = keras.layers.Dense(num_actions, kernel_initializer=keras.initializers.he_normal()) if dueling: self.v_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal()) self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x)) self.combine = keras.layers.Add() def call(self, input): x = self.conv1(input) x = self.conv2(x) x = self.flatten(x) adv = self.adv_dense(x) adv = self.adv_out(adv) if self.dueling: v = self.v_dense(x) v = self.v_out(v) norm_adv = self.lambda_layer(adv) combined = self.combine([v, norm_adv]) return combined return adv primary_network = DQModel(256, num_actions, True) target_network = DQModel(256, num_actions, True) primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse') # make target_network = primary_network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(e) primary_network.compile(optimizer=keras.optimizers.Adam(), loss=tf.keras.losses.Huber())

In the code above, first the Space Invaders environment is created. After this, the DQModel class is defined as a keras.Model base class. In this model, you can observe that first a number of convolutional layers are created, then a flatten layer and dedicated fully connected layers to enact the value and advantage streams. This structure is then implemented in the model *call *function. After this model class has been defined, two versions of it are implemented corresponding to the primary_network and the target_network – as discussed above, both of these will be utilised in the Double Q component of the learning. The target_network weights are then set to be initially equal to the primary_network weights. Finally the primary_network is compiled for training using an Adam optimizer and a Huber loss function. As stated previously, for more details see this post.

Next we will look at the Memory class, which is to hold all the previous experiences of the agent. This class is a little more complicated in the Atari environment case, due to the necessity of dealing with stacked frames:

class Memory: def __init__(self, max_memory): self._max_memory = max_memory self._actions = np.zeros(max_memory, dtype=np.int32) self._rewards = np.zeros(max_memory, dtype=np.float32) self._frames = np.zeros((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], max_memory), dtype=np.float32) self._terminal = np.zeros(max_memory, dtype=np.bool) self._i = 0

In the class __init__ function, it can be observed that all the various memory buffers (for actions, rewards etc.) are initialized according to *max_memory* at the get-go. This is in opposition to a memory approach which involves appending to lists. This is performed so that it can be determined whether there will be a memory problem during training from the very beginning (as opposed to the code falling over after you’ve already been running it for 3 days!). It also increases the efficiency of the memory allocation process (as appending / growing memory dynamically is an inefficient process). You’ll also observe the creation of a counter variable, *self._i*. This is to record the present location of stored samples in the memory buffer, and will ensure that the memory is not overflowed. The next function within the class shows how samples are stored within the class:

def add_sample(self, frame, action, reward, terminal): self._actions[self._i] = action self._rewards[self._i] = reward self._frames[:, :, self._i] = frame[:, :, 0] self._terminal[self._i] = terminal if self._i % (self._max_memory - 1) == 0 and self._i != 0: self._i = BATCH_SIZE + NUM_FRAMES + 1 else: self._i += 1

As will be shown shortly, for every step in the Atari environment, the current image frame, the action taken, the reward received and whether the state is terminal (i.e. the agent ran out of lives and the game ends) is stored in memory. Notice that nothing special as yet is being done with the stored frames – they are simply stored in order as the game progresses. The frame stacking process occurs during the sample extraction method to be covered next. One thing to notice is that once *self._i *reaches *max_memory* the index is reset back to the beginning of the memory buffer (but offset by the batch size and the number of frames). This reset means that, once the memory buffer reaches it’s maximum size, it will begin to overwrite the older samples. The next method in the class governs how random sampling from the memory buffer occurs:

def sample(self): if self._i < BATCH_SIZE + NUM_FRAMES + 1: raise ValueError("Not enough memory to extract a batch") else: rand_idxs = np.random.randint(NUM_FRAMES + 1, self._i, size=BATCH_SIZE) states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES), dtype=np.float32) next_states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES), dtype=np.float32) for i, idx in enumerate(rand_idxs): states[i] = self._frames[:, :, idx - 1 - NUM_FRAMES:idx - 1] next_states[i] = self._frames[:, :, idx - NUM_FRAMES:idx] return states, self._actions[rand_idxs], self._rewards[rand_idxs], next_states, self._terminal[rand_idxs]

First, a simple check is performed to ensure there are enough samples in the memory to actually extract a batch. If so, a set of random indices *rand_idxs* is selected. These random integers are selected from a range with a lower bound of NUM_FRAMES + 1 and an upper bound of *self._i*. In other words, it is possible to select any indices from the start of the memory buffer to the current filled location of the buffer – however, because NUM_FRAMES of images prior to the selected indices is extracted, indices less than NUM_FRAMES are not allowed. The number of random indices selected is equal to the batch size.

Next, some numpy arrays are initialised which will hold the current states and the next states – in this example, these are of size (32, 105, 80, 3) where 3 is the number of frames to be stacked (NUM_FRAMES). A loop is then entered into for each of the randomly selected memory indices. As can be observed, the *states* batch row is populated by the stored frames ranging from *idx – 1 – NUM_FRAMES *to *idx – 1*. In other words, it is the 3 frames including and prior to the randomly selected index *idx – 1.* Alternatively, the batch row for *next_states* is the 3 frames including and prior to the randomly selected index *idx* (think of a window of 3 frames shifted along by 1 position). These variables *states* and *next_states* are then returned from this function, along with the corresponding actions, rewards and terminal flags. The terminal flags communicate whether the game finished for during the randomly selected states. Finally, the memory class is instantiated with the memory size as the argument:

memory = Memory(200000)

The memory size should ideally be as large as possible, but considerations must be given to the amount of memory available on whatever computing platform is being used to run the training.

The following two functions are standard functions to choose the actions and update the target network:

def choose_action(state, primary_network, eps, step): if step < DELAY_TRAINING: return random.randint(0, num_actions - 1) else: if random.random() < eps: return random.randint(0, num_actions - 1) else: return np.argmax(primary_network(tf.reshape(state, (1, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)).numpy())) def update_network(primary_network, target_network): # update target network parameters slowly from primary network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(t * (1 - TAU) + e * TAU)

The *choose_action* function performs the *epsilon-greedy* action selection policy, where a random action is selected if a random value falls below *eps, *otherwise it is selected by choosing the action with the highest Q value from the network. The *update_network* function slowly shifts the target network weights towards the primary network weights in accordance with the Double Q learning methodology. The next function deals with the “state stack” which is an array which holds the last NUM_FRAMES of the episode:

def process_state_stack(state_stack, state): for i in range(1, state_stack.shape[-1]): state_stack[:, :, i - 1].assign(state_stack[:, :, i]) state_stack[:, :, -1].assign(state[:, :, 0]) return state_stack

This function takes the existing state stack array, and the newest state to be added. It then shuffles all the existing frames within the state stack “back” one position. In other words, the most recent state, in this case, sitting in row 2 of the state stack, if shuffled back to row 1. The frame / state in row 1 is shuffled to row 0. Finally, the most recent state or frame is stored in the newly vacated row 2 of the state stack. The state stack is required so that it can be fed into the neural network in order to choose actions, and its updating can be observed in the main training loop, as will be reviewed shortly.

Next up is the training function:

def train(primary_network, memory, target_network=None): states, actions, rewards, next_states, terminal = memory.sample() # predict Q(s,a) given the batch of states prim_qt = primary_network(states) # predict Q(s',a') from the evaluation network prim_qtp1 = primary_network(next_states) # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action target_q = prim_qt.numpy() updates = rewards valid_idxs = terminal != True batch_idxs = np.arange(BATCH_SIZE) if target_network is None: updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1) else: prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1) q_from_target = target_network(next_states) updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]] target_q[batch_idxs, actions] = updates loss = primary_network.train_on_batch(states, target_q) return loss

This *train *function is very similar to the train function reviewed in my first Dueling Q tutorial. Essentially, it first extracts batches of data from the memory buffer. Next the Q values from the current state (*states*) and the following states (*next_states*) are extracted from the primary network – these values are returned in *prim_qt* and *prim_qtp1* respectively (where *qtp1* refers to the Q values for the time *t + 1*). Next, the target Q values are initialized from the *prim_qt* values. After this, the *updates *variable is created – this holds the target Q values for the actions. These target values will be the Q values which the network will “step towards” during the optimization step – hence the name “target” Q values.

The variable *valid_idxs* specifies those indices which don’t include terminal states – obviously for terminal states (states where the game ended), there are no future rewards to discount from, so the target value for these states is the *rewards* value. For other states, which do have future rewards, these need to be discounted and added to the current reward for the target Q values. If no *target_network *is provided, it is assumed vanilla Q learning should be used to provide the discounted target Q values. If not, double Q learning is implemented.

According to that methodology, first the a* actions are selected which are those actions with the highest Q values in the next state (t + 1). These actions are taken from the primary network, using the numpy argmax function. Next, the Q values from the *target* *network* are extracted from the next state (t + 1). Finally, the updates value is incremented for valid indices by adding the discounted future Q values from the target network, for the actions a* selected from the primary network. Finally, the network is trained using the Keras *train_on_batch *function.

Now it is time to review the main training loop:

num_episodes = 1000000 eps = MAX_EPSILON render = False train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQSI_{dt.datetime.now().strftime('%d%m%Y%H%M')}") double_q = True steps = 0 for i in range(num_episodes): state = env.reset() state = image_preprocess(state) state_stack = tf.Variable(np.repeat(state.numpy(), NUM_FRAMES).reshape((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES))) cnt = 1 avg_loss = 0 tot_reward = 0 if i % GIF_RECORDING_FREQ == 0: frame_list = [] while True: if render: env.render() action = choose_action(state_stack, primary_network, eps, steps) next_state, reward, done, info = env.step(action) tot_reward += reward if i % GIF_RECORDING_FREQ == 0: frame_list.append(tf.cast(tf.image.resize(next_state, (480, 320)), tf.uint8).numpy()) next_state = image_preprocess(next_state) state_stack = process_state_stack(state_stack, next_state) # store in memory memory.add_sample(next_state, action, reward, done) if steps > DELAY_TRAINING: loss = train(primary_network, memory, target_network if double_q else None) update_network(primary_network, target_network) else: loss = -1 avg_loss += loss # linearly decay the eps value if steps > DELAY_TRAINING: eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * \ (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else \ MIN_EPSILON steps += 1 if done: if steps > DELAY_TRAINING: avg_loss /= cnt print(f"Episode: {i}, Reward: {tot_reward}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}") with train_writer.as_default(): tf.summary.scalar('reward', tot_reward, step=i) tf.summary.scalar('avg loss', avg_loss, step=i) else: print(f"Pre-training...Episode: {i}") if i % GIF_RECORDING_FREQ == 0: record_gif(frame_list, i) break cnt += 1

This training loop is very similar to the training loop in my Dueling Q tutorial, so for a detailed review, please see that post. The main differences relate to how the frame stacking is handled. First, you’ll notice at the start of the loop that the environment is reset, and the first state / image is extracted. This state or image is pre-processed and then repeated NUM_FRAMES times and reshaped to create the first state or frame stack, of size (105, 80, 3) in this example. Another point to note is that a gif recording function has been created which is called every GIF_RECORDING_FREQ episodes. This function involves simply outputting every frame to a gif so that the training progress can be monitored by observing actual gameplay. As such, there is a frame list which is filled whenever each GIF_RECORDING_FREQ episode comes around, and this frame list is passed to the gif recording function. Check out the code for this tutorial for more details. Finally, it can be observed that after every state, the state stack is processed by shuffling along each recorded frame / state in that stack.

The image below shows how the training progresses through each episode with respect to the total reward received for each episode:

As can be observed from the plot above, the reward steadily increases over 1500 episodes of game play. Note – if you wish to replicate this training on your own, you will need GPU processing support in order to reduce the training timeframes to a reasonable level. In this case, I utilised the Google Cloud Compute Engine and a single GPU. The gifs below show the progress of the agent in gameplay between episode 50 and episode 1450:

As can be observed, after 50 epsiodes the agent still moves around randomly and is quickly killed, achieving a score of only 60 points. However, after 1450 episodes, the agent can be seen to be playing the game much more effectively, even having learnt to destroy the occasional purple “master ship” flying overhead to gain extra points.

This post has demonstrated how to effectively train agents to operate in Atari environments such as Space Invaders. In particular it has demonstrated how to use the Dueling Q reinforcement learning algorithm to train the agent. A future post will demonstrate how to make the training even more efficient using the Prioritised Experience Replay (PER) approach.

As discussed in detail in this post, vanilla deep Q learning has some problems. These problems can be boiled down to two main issues:

- The bias problem: vanilla deep Q networks tend to overestimate rewards in noisy environments, leading to non-optimal training outcomes
- The moving target problem: because the same network is responsible for both the choosing of actions and the evaluation of actions, this leads to training instability

With regards to (1) – say we have a state with two possible actions, each giving noisy rewards. Action *a* returns a random reward based on a normal distribution with a mean of 2 and a standard deviation of 1 – *N(2, 1). *Action *b* returns a random reward from a normal distribution of *N(1, 4)*. On average, action *a* is the optimal action to take in this state – however, because of the *argmax* function in deep Q learning, action *b *will tend to be favoured because of the higher standard deviation / higher random rewards.

For (2) – let’s consider another state, state 1, with three possible actions *a, b, *and *c.* Let’s say we know that *b* is the optimal action. However, when we first initialize the neural network, in state 1, action *a *tends to be chosen. When we’re training our network, the loss function will drive the weights of the network towards choosing action *b.* However, next time we are in state 1, the parameters of the network have changed to such a degree that now action *c *is chosen. Ideally, we would have liked the network to consistently chose action *a *in state 1 until it was gradually trained to chose action *b*. But now the goal posts have shifted, and we are trying to move the network from *c *to *b *instead of *a *to *b* – this gives rise to instability in training. This is the problem that arises when you have the same network both choosing actions and evaluating the worth of actions.

To overcome this problem , Double Q learning proposed the following way of determining the target Q value: $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t)$$ Here $\theta_t$ refers to the primary network parameters (weights) at time *t*, and $\theta^-_t$ refers to something called the *target* *network* parameters at time *t*. This *target network* is a kind of delayed copy of the primary network. As can be observed, the optimal action in state *t + 1* is chosen from the primary network ($\theta_t$) but the evaluation or estimate of the Q value of this action is determined from the *target network *($\theta^-_t$).

This can be shown more clearly by the equations below: $$a* = argmax Q(s_{t+1}, a; \theta_t)$$ $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, a*; \theta^-_t)$$ By doing this two things occur. First, different networks are used to chose the actions and evaluate the actions. This breaks the moving target problem mentioned earlier. Second, the primary network and the target network have essentially been trained on different samples from the memory bank of states and actions (the target network is “trained” on older samples than the primary network). Because of this, any bias due to environmental randomness should be “smoothed out”. As was shown in my previous post on Double Q learning, there is a significant improvement in using Double Q learning instead of vanilla deep Q learning. However, a further improvement can be made on the Double Q idea – the Dueling Q architecture, which will be covered in the next section.

The Dueling Q architecture trades on the idea that the evaluation of the Q function implicitely calculates two quantities:

- V(s) – the
*value*of being in state s - A(s, a) – the
*advantage*of taking action*a*in state s

These values, along with the Q function, Q(s, a), are very important to understand, so we will do a deep dive of these concepts here. Let’s first examine the generalised formula for the value function V(s): $$V^{\pi}(s) = \mathbb{E} \left[ \sum_{i=1}^T \gamma^{i – 1}r_{i}\right]$$ The formula above means that the value function at state s, operating under a policy $\pi$, is the summation of future discounted rewards starting from state s. In other words, if an agent starts at *s*, it is the sum of all the rewards the agent collects operating under a given policy $\pi$. The $\mathbb{E}$ is the expectation operator.

Let’s consider a basic example. Let’s assume an agent is playing a game with a set number of turns. In the second-to-last turn, the agent is in state *s*. From this state, it has 3 possible actions, with a reward of 10, 50 and 100 respectively. Let’s say that the policy for this agent is a simple random selection. Because this is the last set of actions and rewards in the game, due to the game finishing next turn, there are no discounted future rewards. The value for this state and the random action policy is: $$V^{\pi}(s) = \mathbb{E} \left[random\left(10, 50, 100)\right)\right] = 53.333$$ Now, clearly this policy is not going to produce optimum outcomes. However, we know that for the optimum policy, the value of this state would be: $$V^*(s) = \max (10, 50, 100) = 100$$ If you recall, from Q learning theory, the optimal action in this state is: $$a* = argmax Q(s_{t+1}, a)$$ and the optimal Q value from this action in this state would be: $$Q(s, a^*) = \max (10, 50, 100) = 100$$ Therefore, under the optimal (deterministic) policy we have: $$Q(s,a^*) = V(s)$$ However, what if we aren’t operating under the optimal policy (yet)? Let’s return to the case where our policy is simple random action selection. In such a case, the Q function at state s could be described as (remember there are no future discounted rewards, and V(s) = 53.333): $$Q(s, a) = V(s) + (-43.33, -3.33, 46.67) = (10, 50, 100)$$ The term (-43.33, -3.33, 46.67) under such an analysis is called the Advantage function A(s, a). The Advantage function expresses the relative benefits of the various actions possible in state *s*. The Q function can therefore be expressed as: $$Q(s, a) = V(s) + A(s, a)$$ Under the optimum policy we have $A(s, a^*) = 0$, $V(s) = 100$ and therefore: $$Q(s, a) = V(s) + A(s, a) = 100 + (-90, -50, 0) = (10, 50, 100)$$ Now the question becomes, why do we want to decompose the Q function in this way? Because there is a difference between the value of a particular state *s* and the actions proceeding from that state. Consider a game where, from a given state *s*, *all actions lead to the agent dying and ending the game. This is an inherently low value state to be in, and who cares about the actions which one can take in such a state? It is pointless for the learning algorithm to waste training resources trying to find the best actions to take. In such a state, the Q values should be based solely on the value function V, and this state should be avoided. The converse case also holds – some states are just inherently valuable to be in, regardless of the effects of subsequent actions.

Consider these images taken from the original Dueling Q paper – showing the value and advantage components of the Q value in the Atari game Enduro:

In the Atari Enduro game, the goal of the agent is to pass as many cars as possible. “Running into” a car slows the agent’s car down and therefore reduces the number of cars which will be overtaken. In the images above, it can be observed that the value stream considers the road ahead and the score. However, the advantage stream, does not “pay attention” to anything much when there are no cars visible. It only begins to register when there are cars close by and an action is required to avoid them. This is a good outcome, as when no cars are in view, the network should not be trying to determine which actions to take as this is a waste of training resources. This is the benefit of splitting value and advantage functions.

Now, you could argue that, because the Q function inherently contains both the value and advantage functions anyway, the neural network should learn to separate out these components regardless. Indeed, it may do. However, this comes at a cost. If the ML engineer already knows that it is important to try and separate these values, why not build them into the architecture of the network and save the learning algorithm the hassle? That is essentially what the Dueling Q network architecture does. Consider the image below showing the original architecture:

First, notice that the first part of architecture is common, with CNN input filters and a common Flatten layer (for more on convolutional neural networks, see this tutorial). After the flatten layer, the network bifurcates – with separate densely connected layers. The first densely connected layer produces a single output corresponding to V(s). The second densely connected layer produces *n* outputs, where *n *is the number of available actions – and each of these outputs is the expression of the advantage function. These value and advantage functions are then aggregated in the Aggregation layer to produce Q values estimations for each possible action in state *s*. These Q values can then be trained to approach the target Q values, generated via the Double Q mechanism i.e.: $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t)$$ The idea is that, through these separate value and advantage streams, the network will learn to produce accurate estimates of the values and advantages, improving learning performance. What goes on in the aggregation layer? One might think we could just add the V(s) and A(s, a) values together like so: $$Q(s, a) = V(s) + A(s, a)$$ However, there is an issue here and it’s called the problem of identifiabilty. This problem in the current context can be stated as follows: given Q, there is no way to uniquely identify V or A. What does this mean? Say that the network is trying to learn some optimal Q value for action *a*. Given this Q value, can we uniquely learn a V(s) and A(s, a) value? Under the formulation above, the answer is no.

Let’s say the “true” value of being in state *s* is 50 i.e. V(s) = 50. Let’s also say the “true” advantage in state *s* for action *a *is 10. This will give a Q value, Q(s, a) of 60 for this state and action. However, we can also arrive at the same Q value for a learned V(s) of, say, 0, and an advantage function A(s, a) = 60. Or alternatively, a learned V(s) of -1000 and an advantage A(s, a) of 1060. In other words, there is no way to guarantee the “true” values of V(s) and A(s, a) are being learned separately and uniquely from each other. The commonly used solution to this problem is to instead perform the following aggregation function: $$Q(s,a) = V(s) + A(s,a) – \frac{1}{\|a\|}\sum_{a’}A(s,a’)$$ Here the advantage function value is normalized with respect to the mean of the advantage function values over all actions in state *s*.

In TensorFlow 2.0, we can create a common “head” network, consisting of introductory layers which act to process the images or other environmental / state inputs. Then, two separate streams are created using densely connected layers which learn the value and advantage estimates, respectively. These are then combined in a special aggregation layer which calculates the equation above to finally arrive at Q values. Once the network architecture is specified in accordance with the above description, the training proceeds in the same fashion as Double Q learning. The agent actions can be selected either directly from the output of the advantage function, or from the output Q values. Because the Q values differ from the advantage values only by the addition of the V(s) value (which is independent of the actions), the argmax-based selection of the best action will be the same regardless of whether it is extracted from the advantage or the Q values of the network.

In the next section, the implementation of a Dueling Q network in TensorFlow 2.0 will be demonstrated.

In this section we will be building a Dueling Q network in TensorFlow 2. However, the code will be written so that both Double Q and Dueling Q networks will be able to be constructed with the simple change of a boolean identifier. The environment that the agent will train in is Open AI Gym’s CartPole environment. In this environment, the agent must learn to move the cart platform back and forth in order to stop a pole falling too far below the vertical axis. While Dueling Q was originally designed for processing images, with its multiple CNN layers at the beginning of the model, in this example we will be replacing the CNN layers with simple dense connected layers. Because training reinforcement learning agents using images only (i.e. Atari RL environments) takes a long time, in this introductory post, only a simple environment is used for training the model. Future posts will detail how to efficiently train in Atari RL environments. All the code for this tutorial can be found on this site’s Github repo.

First of all, we declare some constants that will be used in the model, and initiate the CartPole environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard' MAX_EPSILON = 1 MIN_EPSILON = 0.01 EPSILON_MIN_ITER = 5000 DELAY_TRAINING = 300 GAMMA = 0.95 BATCH_SIZE = 32 TAU = 0.08 RANDOM_REWARD_STD = 1.0 env = gym.make("CartPole-v0") state_size = 4 num_actions = env.action_space.n

The MAX_EPSILON and MIN_EPSILON variables define the maximum and minimum values of the epsilon-greedy variable which will determine how often random actions are chosen. Over the course of the training, the epsilon-greedy parameter will decay from MAX_EPSILON gradually to MIN_EPSILON. The EPSILON_MIN_ITER value specifies how many training steps it will take before the MIN_EPSILON value is obtained. The DELAY_TRAINING constant specifies how many iterations should occur, with the memory buffer being filled, before training of the network is undertaken. The GAMMA value is the future reward discount value used in the Q-target equation, and TAU is the merging rate of the weight values between the primary network and the target network as per the Double Q learning algorithm. Finally, RANDOM_REWARD_STD is the standard deviation of the rewards that introduces some stochastic behaviour into the otherwise deterministic CartPole environment.

After the definition of all these constants, the CartPole environment is created and the state size and number of actions are defined.

The next step in the code is to create a Keras model inherited class which defines the Double or Dueling Q network:

class DQModel(keras.Model): def __init__(self, hidden_size: int, num_actions: int, dueling: bool): super(DQModel, self).__init__() self.dueling = dueling self.dense1 = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.dense2 = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.adv_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.adv_out = keras.layers.Dense(num_actions, kernel_initializer=keras.initializers.he_normal()) if dueling: self.v_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal()) self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x)) self.combine = keras.layers.Add() def call(self, input): x = self.dense1(input) x = self.dense2(x) adv = self.adv_dense(x) adv = self.adv_out(adv) if self.dueling: v = self.v_dense(x) v = self.v_out(v) norm_adv = self.lambda_layer(adv) combined = self.combine([v, norm_adv]) return combined return adv

Let’s go through the above line by line. First, a number of parameters are passed to this model as part of its initialization – these include the size of the hidden layers of the advantage and value streams, the number of actions in the environment and finally a Boolean variable, *dueling*, to specify whether the network should be a standard Double Q network or a Dueling Q network. The first two model layers defined are simple Keras densely connected layers, *dense1* and *dense2*. These layers have ReLU activations and use the He normal weight initialization. The next two layers defined *adv_dense* and *adv_out* pertain to the advantage stream of the network, provided we are discussing a Dueling Q network architecture. If in fact the network is to be a Double Q network (i.e. dueling == False), then these names are a bit misleading and will simply be a third densely connected layer followed by the output Q layer *(adv_out).* However, keeping with the Dueling Q terminology, the first dense layer associated with the advantage stream is simply another standard dense layer of size = hidden_size. The final layer in this stream, *adv_out* is a dense layer with only *num_actions* outputs – each of these outputs will learn to estimate the advantage of all the actions in the given state (A(s, a)).

If the network is specified to be a Dueling Q network (i.e. dueling == True), then the value stream is also created. Again, a standard densely connected layer of size = hidden_size is created (*v_dense*). Then a final, single node dense layer is created to output the single value estimation (V(s)) for the given state. These layers specify the advantage and value streams respectively. Now the aggregation layer is to be created. This aggregation layer is created by using two Keras layers – a Lambda layer and an Add layer. The Lambda layer allows the developer to specify some user-defined operation to perform on the inputs to the layer. In this case, we want the layer to calculate the following: $$A(s,a) – \frac{1}{\|a\|}\sum_{a’}A(s,a’)$$ This is calculated easily by using the *lambda x: x – tf.reduce_mean(x) *expression in the Lambda layer. Finally, we need a simple Keras addition layer to add this mean-normalized advantage function to the value estimation.

This completes the explanation of the layer definitions in the model. The *call* method in this model definition then applies these various layers to the state inputs of the model. The following two lines execute the Dueling Q aggregation function:

norm_adv = self.lambda_layer(adv) combined = self.combine([v, norm_adv])

Note that first the mean-normalizing lambda function is applied to the output from the advantage stream. This normalized advantage is then added to the value stream output to produce the final Q values (*combined*). Now that the model class has been defined, it is time to instantiate two models – one for the primary network and the other for the target network:

primary_network = DQModel(30, num_actions, True) target_network = DQModel(30, num_actions, True) primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse') # make target_network = primary_network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(e)

After the *primary_network *and *target_network* have been created, only the *primary_network *is compiled as only the primary network is actually trained using the optimization function. As per Double Q learning, the target network is instead moved slowly “towards” the primary network by the gradual merging of weight values. Initially however, the target network trainable weights are set to be equal to the primary network trainable variables, using the TensorFlow assign function.

The next function to discuss is the target network updating which is performed during training. In Double Q network training, there are two options for transitioning the target network weights towards the primary network weights. The first is to perform a wholesale copy of the weights every N training steps. Alternatively, the weights can be moved towards the primary network gradually every training iteration as follows:

def update_network(primary_network, target_network): # update target network parameters slowly from primary network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(t * (1 - TAU) + e * TAU)

As can be observed, the new target weight variables are a weighted average between the current weight values and the primary network weights – with the weighting factor equal to TAU. The next code snippet is the definition of the memory class:

class Memory: def __init__(self, max_memory): self._max_memory = max_memory self._samples = [] def add_sample(self, sample): self._samples.append(sample) if len(self._samples) > self._max_memory: self._samples.pop(0) def sample(self, no_samples): if no_samples > len(self._samples): return random.sample(self._samples, len(self._samples)) else: return random.sample(self._samples, no_samples) @property def num_samples(self): return len(self._samples) memory = Memory(500000)

This class takes tuples of (state, action, reward, next state) values and appends them to a memory list, which is randomly sampled from when required during training. The next function defines the epsilon-greedy action selection policy:

def choose_action(state, primary_network, eps): if random.random() < eps: return random.randint(0, num_actions - 1) else: return np.argmax(primary_network(state.reshape(1, -1)))

If a random number sampled between the interval 0 and 1 falls below the current epsilon value, a random action is selected. Otherwise, the current state is passed to the primary model – from which the Q values for each action are returned. The action with the highest Q value, selected by the numpy argmax function, is returned.

The next function is the *train *function, where the training of the primary network takes place:

def train(primary_network, memory, target_network): batch = memory.sample(BATCH_SIZE) states = np.array([val[0] for val in batch]) actions = np.array([val[1] for val in batch]) rewards = np.array([val[2] for val in batch]) next_states = np.array([(np.zeros(state_size) if val[3] is None else val[3]) for val in batch]) # predict Q(s,a) given the batch of states prim_qt = primary_network(states) # predict Q(s',a') from the evaluation network prim_qtp1 = primary_network(next_states) # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action target_q = prim_qt.numpy() updates = rewards valid_idxs = np.array(next_states).sum(axis=1) != 0 batch_idxs = np.arange(BATCH_SIZE) # extract the best action from the next state prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1) # get all the q values for the next state q_from_target = target_network(next_states) # add the discounted estimated reward from the selected action (prim_action_tp1) updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]] # update the q target to train towards target_q[batch_idxs, actions] = updates # run a training batch loss = primary_network.train_on_batch(states, target_q) return loss

For a more detailed explanation of this function, see my Double Q tutorial. However, the basic operations that are performed are expressed in the following formulas: $$a* = argmax Q(s_{t+1}, a; \theta_t)$$ $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, a*; \theta^-_t)$$ The best action from the next state, a*, is selected from the primary network (weights = $\theta_t$). However, the Q value for this action in the next state ($s_{t+1}$) is extracted from the target network (weights = $\theta^-_t$). A Keras train_on_batch operation is performed by passing a batch of states and subsequent target Q values, and the loss is finally returned from this function.

The main training loop which trains our Dueling Q network is shown below:

num_episodes = 1000000 eps = MAX_EPSILON render = False train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}") steps = 0 for i in range(num_episodes): cnt = 1 avg_loss = 0 tot_reward = 0 state = env.reset() while True: if render: env.render() action = choose_action(state, primary_network, eps) next_state, _, done, info = env.step(action) reward = np.random.normal(1.0, RANDOM_REWARD_STD) tot_reward += reward if done: next_state = None # store in memory memory.add_sample((state, action, reward, next_state)) if steps > DELAY_TRAINING: loss = train(primary_network, memory, target_network) update_network(primary_network, target_network) else: loss = -1 avg_loss += loss # linearly decay the eps value if steps > DELAY_TRAINING: eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * \ (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else \ MIN_EPSILON steps += 1 if done: if steps > DELAY_TRAINING: avg_loss /= cnt print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}") with train_writer.as_default(): tf.summary.scalar('reward', cnt, step=i) tf.summary.scalar('avg loss', avg_loss, step=i) else: print(f"Pre-training...Episode: {i}") break state = next_state cnt += 1

Again, this training loop has been explained in detail in the Double Q tutorial. However, some salient points are worth highlighting. First, Double and Dueling Q networks are superior to vanilla Deep Q networks especially in the cases where there is some stochastic component to the environment. As the CartPole environment is deterministic, some stochasticity is added in the reward. Normally, every time step in the episode results in a reward of 1 i.e. the CartPole has survived another time step – good job. However, in this case I’ve added a reward which is sampled from a normal distribution with a mean of 1.0 but a standard deviation of RANDOM_REWARD_STD. This adds the requisite uncertainty which makes Double and Dueling Q networks clearly superior to Deep Q networks – see my Double Q tutorial for a demonstration of this.

Another point to highlight is that training of the primary (and by extension target) network until DELAY_TRAINING steps have been exceeded. Also, the epsilon value for the epsilon-greedy action selection policy doesn’t decay until these DELAY_TRAINING steps have been exceeded.

A comparison of the training progress with respect to the deterministic reward of the agent in the CartPole environment under Double Q and Dueling Q architectures can be observed in the figure below, with the x-axis being the number of episodes:

As can be observed, there is a slightly higher performance of the Double Q network with respect to the Dueling Q network. However, the performance difference is fairly marginal, and may be within the variation arising from the random weight initialization of the networks. There is also the issue of the Dueling Q network being slightly more complicated due to the additional value stream. As such, on a fairly simple environment like the CartPole environment, the benefits of Dueling Q over Double Q may not be realized. However, in more complex environments like Atari environments, it is likely that the Dueling Q architecture will be superior to Double Q (this is what the original Dueling Q paper has shown). Future posts will demonstrate the Dueling Q architecture in Atari environments.

The vanishing gradient problem was an initial barrier to making neural networks deeper and more powerful. However, as explained in this post, the problem has now largely been solved through the use of ReLU activations and batch normalization. Given this is true, and given enough computational power and data, we should be able to stack many CNN layers and dramatically increase classification accuracy, right? Well – to a degree. An early architecture, called the VGG-19 architecture, had 19 layers. However, this is a long way off the 152 layers of the version of ResNet that won the ILSVRC 2015 image classification task. The reason deeper networks were not successful prior to the ResNet architecture was due to something called the *degradation* problem. Note, this is *not* the vanishing gradient problem, but something else. It was observed that making the network deeper led to higher classification errors. One might think this is due to overfitting of the data – but not so fast, the degradation problem leads to higher *training *errors too! Consider the diagrams below from the original ResNet paper:

Note that the 56-layer network has higher test *and training *errors. Theoretically, this doesn’t make much sense. Let’s say the 20-layer network learns some mapping *H(x)* that gives a training error of 10%. If another 36 layers are added, we would expect that the error would *at least *not be any worse than 10%. Why? Well, the 36 extra layers, *at worst, *could just learn identity functions. In other words, the extra 36 layers could just learn to *pass through* the output from the first 20-layers of the network. This would give the same error of 10%. This doesn’t seem to happen though. It appears neural networks aren’t great at learning the identity function in deep architectures. Not only don’t they learn the identity function (and hence *pass through* the 20 layer error rate), they *make things worse*. Beyond a certain number of layers, they begin to degrade the performance of the network compared to shallower implementations. Here is where the ResNet architecture comes in.

The ResNet solution relies on making the identity function *option* explicit in the architecture, rather than relying on the network itself to learn the identity function where appropriate. It consists of building networks which consist of the following CNN blocks:

In the diagram above, the input tensor *x* enters the building block. This input then splits. On one path, the input is processed by two stacked convolutional layers (called a “weight layer” in the above). This path is the “standard” CNN processing part of the building block. The ResNet innovation is the “identity” path. Here, the input *x* is simply added to the output of the CNN component of the building block, *F(x)*. The output from the block is then *F(x) + x* with a final ReLU activation applied at the end. This identity path in the ResNet building block allows the neural network to more easily *pass through* any abstractions learnt in previous layers. Alternatively, it can more easily build *incremental* abstractions on top of the abstractions learnt in the previous layers. What do I mean by this? The diagram below may help:

Generally speaking, as CNN layers are added to a network, the network during training will learn lower level abstractions in the early layers (i.e lines, colours, corners, basic shapes etc.) and higher level abstractions in the later layers (groups of geometries, objects etc.). Let’s say that, when trying to classify an aircraft in an image, there are some mid-level abstractions which reliably signal that an aircraft is present. Say the shape of a jet engine near a wing (this is just an example). These abstractions might be able to be learnt in, say, 10 layers.

However, if we add an additional 20 or more layers after these first 10 layers, these reliable signals may get degraded / obfuscated. The ResNet architecture gives the network a more explicit chance of muting further CNN abstractions on some filters by driving *F(x) *to zero, with the output of the block defaulting to its input *x*. Not only that, the ResNet architecture allows blocks to “tinker” more easily with the input. This is because the block only has to learn the incremental difference between the previous layer abstraction and the optimal output *H(x)*. In other words, it has to learn *F(x) = H(x) – x. *This is a residual expression, hence the name *Res*Net. This, theoretically at least, should be easier to learn than the full expression *H(x)*.

An (somewhat tortured) analogy might assist here. Say you are trying to draw the picture of a tree. Someone hands you a picture of a pencil outline of the main structure of the tree – the trunk, large branches, smaller branches etc. Now say you are somewhat proud, and you don’t want too much help in drawing the picture. So, you rub out parts of the pencil outline of the tree that you were handed. You then proceed to add some detail to the picture you were handed, but you have to redraw parts that you already rubbed out. This is kind of like the case of a standard non-ResNet network. Because layers seem to struggle to reproduce an identity function, at each subsequent layer they essentially erase or degrade some of the previous level abstractions and these need to be re-estimated (at least to an extent).

Alternatively, you, the artist, might not be too proud and you happily accept the pencil outline that you received. It is much easier to then add new details to what you have already been given. This is like what the ResNet blocks do – they take what they are give i.e. *x* and just make tweaks to it by adding *F(x)*. This analogy isn’t perfect, but it should give you an idea of what is going on here, and how the ResNet blocks help the learning along.

A full 34-layer version of ResNet is (partially) illustrated below (from the original paper):

The diagram above shows roughly the first half of the ResNet 34-layer architecture, along with the equivalent layers of the VGG-19 architecture and a “plain” version of the ResNet architecture. The “plain” version has the same CNN layers, but lacks the identity path previously presented in the ResNet building block. These identity paths can be seen looping around every second CNN layer on the right hand side of the ResNet (“residual”) architecture.

In the next section, I’m going to show you how to build a ResNet architecture in TensorFlow 2/Keras. In the example, we’ll compare both the “plain” and “residual” networks on the CIFAR-10 classification task. Note that for computational ease, I’ll only include 10 ResNet blocks.

As discussed previously, the code for this example can be found on this site’s Github repository. Importing the CIFAR-10 dataset can be performed easily by using the Keras datasets API:

import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers import numpy as np import datetime as dt (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

We then perform some pre-processing of the training and test data. This pre-processing includes image renormalization (converting the data so it resides in the range [0,1]) and centrally cropping the image to 75% of it’s normal extents. Data augmentation is also performed by randomly flipping the image about the centre axis. This is performed using the TensorFlow Dataset API – more details on the code below can be found in this, this post and my book.

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64).shuffle(10000) train_dataset = train_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y)) train_dataset = train_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y)) train_dataset = train_dataset.map(lambda x, y: (tf.image.random_flip_left_right(x), y)) train_dataset = train_dataset.repeat() valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000) valid_dataset = valid_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y)) valid_dataset = valid_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y)) valid_dataset = valid_dataset.repeat()

In this example, to build the network, we’re going to use the Keras Functional API, in the TensorFlow 2 context. Here is what the ResNet model definition looks like:

inputs = keras.Input(shape=(24, 24, 3)) x = layers.Conv2D(32, 3, activation='relu')(inputs) x = layers.Conv2D(64, 3, activation='relu')(x) x = layers.MaxPooling2D(3)(x) num_res_net_blocks = 10 for i in range(num_res_net_blocks): x = res_net_block(x, 64, 3) x = layers.Conv2D(64, 3, activation='relu')(x) x = layers.GlobalAveragePooling2D()(x) x = layers.Dense(256, activation='relu')(x) x = layers.Dropout(0.5)(x) outputs = layers.Dense(10, activation='softmax')(x) res_net_model = keras.Model(inputs, outputs)

First, we specify the input dimensions to Keras. The raw CIFAR-10 images have a size of (32, 32, 3) – but because we are performing central cropping of 75%, the post-processed images are of size (24, 24, 3). Next, we create 2 standard CNN layers, with 32 and 64 filters respectively (for more on convolutional layers, see this post and my book). The filter window sizes are 3 x 3, in line with the original ResNet architectures. Next some max pooling is performed and then it is time to produce some ResNet building blocks. In this case, 10 ResNet blocks are created by calling the res_net_block() function:

def res_net_block(input_data, filters, conv_size): x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters, conv_size, activation=None, padding='same')(x) x = layers.BatchNormalization()(x) x = layers.Add()([x, input_data]) x = layers.Activation('relu')(x) return x

The first few lines of this function are standard CNN layers with Batch Normalization, except the 2nd layer does not have an activation function (this is because one will be applied after the residual addition part of the block). After these two layers, the residual addition part, where the input data is added to the CNN output (*F(x)*), is executed. Here we can make use of the Keras Add layer, which simply adds two tensors together. Finally, a ReLU activation is applied to the result of this addition and the outcome is returned.

After the ResNet block loop is finished, some final layers are added. First, a final CNN layer is added, followed by a Global Average Pooling (GAP) layer (for more on GAP layers, see here). Finally, we have a couple of dense classification layers with a dropout layer in between. This model was trained over 30 epochs and then an alternative “plain” model was also created. This was created by taking the same architecture but replacing the *res_net_block* function with the following function:

def non_res_block(input_data, filters, conv_size): x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data) x = layers.BatchNormalization()(x) x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(x) x = layers.BatchNormalization()(x) return x

Note that this function is simply two standard CNN layers, with no residual components included. The training code is as follows:

callbacks = [ # Write TensorBoard logs to `./logs` directory keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")), write_images=True), ] res_net_model.compile(optimizer=keras.optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['acc']) res_net_model.fit(train_dataset, epochs=30, steps_per_epoch=195, validation_data=valid_dataset, validation_steps=3, callbacks=callbacks)

The accuracy results of the training of these two models can be observed below:

As can be observed there is around a 5-6% improvement in the training accuracy from a ResNet architecture compared to the “plain” non-ResNet architecture. I have run this comparison a number of times and the 5-6% gap is consistent across the runs. These results illustrate the power of the ResNet idea, even for a relatively shallow 10 layer ResNet architecture. As demonstrated in the original paper, this effect will be more pronounced in deeper networks. Note that this network is not very well optimized, and the accuracy could be improved by running for more iterations. However, it is enough to show the benefits of the ResNet architecture. In future posts, I’ll demonstrate other ResNet-based architectures which can achieve even better results.