# Policy Gradient Reinforcement Learning in TensorFlow 2

In a series of recent posts, I have been reviewing the various Q based methods of deep reinforcement learning (see here, here, here, here and so on). Deep Q based reinforcement learning operates by training a neural network to learn the Q value for each action of an agent which resides in a certain state s of the environment. The policy which guides the actions of the agent in this paradigm operates by a random selection of actions at the beginning of training (the epsilon greedy method), but then the agent will select actions based on the highest Q value predicted in each state s. The Q value is simply an estimation of future rewards which will result from taking action a. An alternative to the deep Q based reinforcement learning is to forget about the Q value and instead have the neural network estimate the optimal policy directly. Reinforcement learning methods based on this idea are often called Policy Gradient methods.

This post will review the REINFORCE or Monte-Carlo version of the Policy Gradient methodology. This methodology will be used in the Open AI gym Cartpole environment. All code used and explained in this post can be found on this site’s Github repository.

## Policy Gradients and their theoretical foundation

This section will review the theory of Policy Gradients, and how we can use them to train our neural network for deep reinforcement learning. This section will feature a fair bit of mathematics, but I will try to explain each step and idea carefully for those who aren’t as familiar with the mathematical ideas. We’ll also skip over a step at the end of the analysis for the sake of brevity.

In Policy Gradient based reinforcement learning, the objective function which we are trying to maximise is the following:

$$J(\theta) = \mathbb{E}_{\pi_\theta} \left[\sum_{t=0}^{T-1} \gamma^t r_t \right]$$

The function above means that we are attempting to find a policy ($\pi$) with parameters ($\theta$) which maximises the expected value of the sum of the discounted rewards of an agent in an environment. Therefore, we need to find a way of varying the parameters of the policy $\theta$ such that the expected value of the discounted rewards are maximised. In the deep reinforcement learning case, the parameters $\theta$ are the parameters of the neural network.

Note the difference to the deep Q learning case – in deep Q based learning, the parameters we are trying to find are those that minimise the difference between the actual Q values (drawn from experiences) and the Q values predicted by the network. However, in Policy Gradient methods, the neural network directly determines the actions of the agent – usually by using a softmax output and sampling from this.

Also note that, because environments are usually non-deterministic, under any given policy ($\pi_\theta$) we are not always going to get the same reward. Rather, we are going to be sampling from some probability function as the agent operates in the environment, and therefore we are trying to maximise the expected sum of rewards, not the 100% certain, we-will-get-this-every-time reward sum.

Ok, so we want to learn the optimal $\theta$. The way we generally learn parameters in deep learning is by performing some sort of gradient based search of $\theta$. So we want to iteratively execute the following:

$$\theta \leftarrow \theta + \alpha \nabla J(\theta)$$

So the question is, how do we find $\nabla J(\theta)$?

First, let’s make the expectation a little more explicit. Remember, the expectation of the value of a function $f(x)$ is the summation of all the possible values due to variations in x multiplied by the probability of x, like so:

$$\mathbb{E}[f(x)] = \sum_x P(x)f(x)$$

Ok, so what does the cashing out of the expectation in $J(\theta)$ look like? First, we have to define the function which produces the rewards, i.e. the rewards equivalent of $f(x)$ above. Let’s call this $R(\tau)$ (where
$R(\tau) = \sum_{t=0}^{T-1}r_t$, ignoring discounting for the moment).  The value $\tau$ is the trajectory of the agent “moving” through the environment. It can be defined as:

$$\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$$

The trajectory, as can be seen, is the progress of the agent through an episode of a game of length T. This trajectory is the fundamental factor which determines the sum of the rewards – hence $R(\tau)$. This covers the $f(x)$ component of the expectation definition. What about the $P(x)$ component? In this case, it is equivalent to $P(\tau)$ – but what does this actually look like in a reinforcement learning environment?

It consists of the two components – the probabilistic policy function which yields an action $a_t$ from states $s_t$ with a certain probability, and a probability that state $s_{t+1}$ will result from taking action $a_t$ from state $s_t$. The latter probabilistic component is uncertain due to the random nature of many environments. These two components operating together will “roll out” the trajectory of the agent $\tau$.

$P(\tau)$ looks like:

$$P(\tau) = \prod_{t=0}^{T-1} P_{\pi_{\theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)$$

If we take the first step, starting in state $s_0$ – our neural network will produce a softmax output with each action assigned a certain probability. The action is then selected by weighted random sampling subject to these probabilities – therefore, we have a probability of action $a_0$ being selected according to $P_{\pi_{\theta}}(a_t|s_t)$. This probability is determined by the policy $\pi$ which in turn is parameterised according to $\theta$ (i.e. a neural network with weights $\theta$.  The next term will be $P(s_1|s_0,a_0)$ which expresses any non-determinism in the environment.

(Note: the vertical line in the probability functions above are conditional probabilities. $P_{\pi_{\theta}}(a_t|s_t)$ refers to the probability of action $a_t$ being selected, given the agent is in state $s_t$).

These probabilities are multiplied out over all the steps in the episode of length T to produce the trajectory $\tau$. (Note, the probability of being in the first state, $s_0$, has been excluded from this analysis for simplicity). Now we can substitute $P(\tau)$ and $R(\tau)$ into the original expectation and take the derivative to get to $\nabla J(\theta)$ which is what we need to do the gradient based optimisation. However, to get there, we first need to apply a trick or two.

First, let’s take the log derivative of $P(\tau)$ with respect to $\theta$ i.e. $\nabla_\theta$ and work out what we get:

$$\nabla_\theta \log P(\tau) = \nabla \log \left(\prod_{t=0}^{T-1} P_{\pi_{\theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)\right)$$
$$=\nabla_\theta \left[\sum_{t=0}^{T-1} (\log P_{\pi_{\theta}}(a_t|s_t) + \log P(s_{t+1}|s_t,a_t)) \right]$$
$$=\nabla_\theta \sum_{t=0}^{T-1}\log P_{\pi_{\theta}}(a_t|s_t)$$

The reason we are taking the log will be made clear shortly. As can be observed, when the log is taken of the multiplicative operator ($\prod$) this is converted to a summation (as multiplying terms within a log function is equivalent to adding them separately). In the final line, it can be seen that taking the derivative with respect to the parameters ($\theta$) removes the dynamics of the environment ($\log P(s_{t+1}|s_t,a_t))$) as these are independent of the neural network parameters / $\theta$.

Let’s go back to our original expectation function, substituting in our new trajectory based functions, and apply the derivative (again ignoring discounting for simplicity):

$$J(\theta) = \mathbb{E}[R(\tau)]$$
$$= \smallint P(\tau) R(\tau)$$
$$\nabla_\theta J(\theta) = \nabla_\theta \smallint P(\tau) R(\tau)$$

So far so good. Now, we are going to utilise the following rule which is sometimes called the “log-derivative” trick:

$$\frac{\nabla_\theta p(X,\theta)}{p(X, \theta)} = \nabla_\theta \log p(X,\theta)$$

We can then apply the $\nabla_{\theta}$ operator within the integral, and cajole our equation so that we get the $\frac{\nabla_{\theta} P{\tau}}{P(\tau)}$ expression like so:

$$\nabla_\theta J(\theta)=\int P(\tau) \frac{\nabla_\theta P(\tau)}{P(\tau)} R(\tau)$$

Then, using the log-derivative trick and applying the definition of expectation, we arrive at:

$$\nabla_\theta J(\theta)=\mathbb{E}\left[R(\tau) \nabla_\theta logP(\tau)\right]$$

We can them substitute our previous derivation of $\nabla_{\theta} log P(\tau)$ into the above to arrive at:

$$\nabla_\theta J(\theta) =\mathbb{E}\left[R(\tau) \nabla_\theta \sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right]$$

This is now close to the point of being something we can work with in our learning algorithm. Let’s take it one step further by recognising that, during our learning process, we are randomly sampling trajectories from the environment, and hoping to make informed training steps. Therefore, we can recognise that, to maximise the expectation above, we need to maximise it with respect to its argument i.e. we maximise:

$$\nabla_\theta J(\theta) \sim R(\tau) \nabla_\theta \sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)$$

Recall that $R(\tau)$ is equal to $R(\tau) = \sum_{t=0}^{T-1}r_t$ (ignoring discounting). Therefore, we have two summations that need to be multiplied out, element by element. It turns out that after doing this, we arrive at an expression like so:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)\left(\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’} \right)$$

As can be observed, there are two main components that need to be multiplied. However, one should note the differences in the bounds of the summation terms in the equation above – these will be explained in the next section.

The way we compute the gradient as expressed above in the REINFORCE method of the Policy Gradient algorithm involves sampling trajectories through the environment to estimate the expectation, as discussed previously. This REINFORCE method is therefore a kind of Monte-Carlo algorithm. Let’s consider this a bit more concretely.

Let’s say we initialise the agent and let it play a trajectory $\tau$ through the environment. The actions of the agent will be selected by performing weighted sampling from the softmax output of the neural network – in other words, we’ll be sampling the action according to $P_{\pi_{\theta}}(a_t|s_t)$. At each step in the trajectory, we can easily calculate $log P_{\pi_{\theta}}(a_t|s_t)$ by simply taking the log of the softmax output values from the neural network. So, for the first step in the trajectory, the neural network would take the initial states $s_0$ as input, and it would produce a vector of actions $a_0$ with pseudo-probabilities generated by the softmax operation in the final layer.

What about the second part of the $\nabla_\theta J(\theta)$ equation – $\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’}$? We can see that the summation term starts at $t’ = t + 1 = 1$. The summation then goes from t=1 to the total length of the trajectory T – in other words, from t=1 to the total length of the episode. Let’s say the episode length was 4 states long – this term would then look like $\gamma^0 r_1 + \gamma^1 r_2 + \gamma^2 r_3$, where $\gamma$ is the discounting factor and is < 1.

Straight-forward enough. However, you may have realised that, in order to calculate the gradient $\nabla_\theta J(\theta)$ at the first step in the trajectory/episode, we need to know the reward values of all subsequent states in the episode. Therefore, in order to execute this method of learning, we can only take gradient learning steps after the full episode has been played to completion. Only after the episode is complete can we perform the training step.

We are almost ready to move onto the code part of this tutorial. However, this is a good place for a quick discussion about how we would actually implement the calculations $\nabla_\theta J(\theta)$ equation in TensorFlow 2 / Keras. It turns out we can just use the standard cross entropy loss function to execute these calculations. Recall that cross entropy is defined as (for a deeper explanation of entropy, cross entropy, information and KL divergence, see this post):

$$CE = -\sum p(x) log(q(x))$$

Which is just the summation between one function $p(x)$ multiplied by the log of another function $q(x)$ over the possible values of the argument x. If we look at the source code of the Keras implementation of cross-entropy, we can see the following:

Keras output of cross-entropy loss function

The output tensor here is simply the softmax output of the neural network, which, for our purposes, will be a tensor of size (num_steps_in_episode, num_actions). Note that the log of output is calculated in the above. The target value, for our purposes, can be all the discounted rewards calculated at each step in the trajectory, and will be of size (num_steps_in_episode, 1). The summation of the multiplication of these terms is then calculated (reduce_sum). Gradient based training in TensorFlow 2 is generally a minimisation of the loss function, however, we want to maximise the calculation as discussed above. The good thing is, the sign of cross entropy calculation shown above is inverted – so we are good to go.

To call this training step utilising Keras, all we have to do is execute something like the following:

network.train_on_batch(states, discounted_rewards)

Here, we supply all the states gathered over the length of the episode, and the discounted rewards at each of those steps. The Keras backend will pass the states through network, apply the softmax function, and this will become the output variable in the Keras source code snippet above. Likewise, discounted_rewards is the same as target in the source code snippet above.

Now that we have covered all the pre-requisite knowledge required to build a REINFORCE-type method of Policy Gradient reinforcement learning, let’s have a look at how this can be coded and applied to the Cartpole environment.

## Policy Gradient reinforcement learning in TensorFlow 2 and Keras

In this section, I will detail how to code a Policy Gradient reinforcement learning algorithm in TensorFlow 2 applied to the Cartpole environment. As always, the code for this tutorial can be found on this site’s Github repository.

First, we define the network which we will use to produce $P_{\pi_{\theta}}(a_t|s_t)$ with the state as the input:

GAMMA = 0.95
env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n
network = keras.Sequential([
keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
keras.layers.Dense(num_actions, activation='softmax')
])
network.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.Adam())

As can be observed, first the environment is initialised. Next, the network is defined using the Keras Sequential API. The network consists of 3 densely connected layers. The first 2 layers have ReLU activations, and the final layer has a softmax activation to produce the pseudo-probabilities to approximate $P_{\pi_{\theta}}(a_t|s_t)$. Finally, the network is compiled with a cross entropy loss function and an Adam optimiser.

The next part of the code chooses the action from the output of the model:

def get_action(network, state, num_actions):
softmax_out = network(state.reshape((1, -1)))
selected_action = np.random.choice(num_actions, p=softmax_out.numpy()[0])
return selected_action

As can be seen, first the softmax output is extracted from the network by inputing the current state. The action is then selected by making a random choice from the number of possible actions, with the probabilities weighted according to the softmax values.

The next function is the main function involved in executing the training step:

def update_network(network, rewards, states, actions, num_actions):
reward_sum = 0
discounted_rewards = []
for reward in rewards[::-1]:  # reverse buffer r
reward_sum = reward + GAMMA * reward_sum
discounted_rewards.append(reward_sum)
discounted_rewards.reverse()
discounted_rewards = np.array(discounted_rewards)
# standardise the rewards
discounted_rewards -= np.mean(discounted_rewards)
discounted_rewards /= np.std(discounted_rewards)
states = np.vstack(states)
loss = network.train_on_batch(states, discounted_rewards)
return loss

First, the discounted rewards list is created: this is a list where each element corresponds to the summation from t + 1 to T according to $\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’}$. The input argument rewards is a list of all the rewards achieved at each step in the episode. The rewards[::-1] operation reverses the order of the rewards list, so the first run through the for loop will deal with last reward recorded in the episode. As can be observed, a reward sum is accumulated each time the for loop is executed. Let’s say that the episode length is equal to 4 – $r_3$ will refer to the last reward recorded in the episode. In this case, the discounted_rewards list would look like:

[$r_3$, $r_2 + \gamma r_3$, $r_1 + \gamma r_2 + \gamma^2 r_3$, $r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3$]

This list is in reverse to the order of the actual state value list (i.e. [$s_0$, $s_1$, $s_2$, $s_3$]), so the next line after the for loop reverses the list (discounted_rewards.reverse()).

Next, the list is converted into a numpy array, and the rewards are normalised to reduce the variance in the training. Finally, the states list is stacked into a numpy array and both this array and the discounted rewards array are passed to the Keras train_on_batch function, which was detailed earlier.

The next part of the code is the main episode and training loop:

num_episodes = 10000000
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
for episode in range(num_episodes):
state = env.reset()
rewards = []
states = []
actions = []
while True:
action = get_action(network, state, num_actions)
new_state, reward, done, _ = env.step(action)
states.append(state)
rewards.append(reward)
actions.append(action)
if done:
loss = update_network(network, rewards, states, actions, num_actions)
tot_reward = sum(rewards)
print(f"Episode: {episode}, Reward: {tot_reward}, avg loss: {loss:.5f}")
with train_writer.as_default():
tf.summary.scalar('reward', tot_reward, step=episode)
tf.summary.scalar('avg loss', loss, step=episode)
break
state = new_state

As can be observed, at the beginning of each episode, three lists are created which will contain the state, reward and action values for each step in the episode / trajectory. These lists are appended to until the done flag is returned from the environment signifying that the episode is complete. At the end of the episode, the training step is performed on the network by running update_network. Finally, the rewards and loss are logged in the train_writer for viewing in TensorBoard.

The training results can be observed below:

Training progress of Policy Gradient RL in Cartpole environment

As can be observed, the rewards steadily progress until they “top out” at the maximum possible reward summation for the Cartpole environment, which is equal to 200. However, the user can verify that repeated runs of this version of Policy Gradient training has a high variance in its outcomes. Therefore, improvements in the Policy Gradient REINFORCE algorithm are required and available – these improvements will be detailed in future posts.