In a series of recent posts, I have been reviewing the various Q based methods of deep reinforcement learning (see here, here, here, here and so on). Deep Q based reinforcement learning operates by training a neural network to learn the Q value for each action *a *of an agent which resides in a certain state *s* of the environment. The *policy* which guides the actions of the agent in this paradigm operates by a random selection of actions at the beginning of training (the epsilon greedy method), but then the agent will select actions based on the highest Q value predicted in each state *s*. The Q value is simply an estimation of future rewards which will result from taking action *a*. An alternative to the deep Q based reinforcement learning is to forget about the Q value and instead have the neural network estimate the optimal policy *directly. *Reinforcement learning methods based on this idea are often called *Policy Gradient* methods.

This post will review the REINFORCE or Monte-Carlo version of the Policy Gradient methodology. This methodology will be used in the Open AI gym Cartpole environment. All code used and explained in this post can be found on this site’s Github repository.

### Eager to build deep learning systems in TensorFlow 2? Get the book **here**

## Policy Gradients and their theoretical foundation

This section will review the theory of Policy Gradients, and how we can use them to train our neural network for deep reinforcement learning. This section will feature a fair bit of mathematics, but I will try to explain each step and idea carefully for those who aren’t as familiar with the mathematical ideas. We’ll also skip over a step at the end of the analysis for the sake of brevity.

In Policy Gradient based reinforcement learning, the objective function which we are trying to maximise is the following:

*expected*sum of rewards, not the 100% certain, we-will-get-this-every-time reward sum.

## Finding the Policy Gradient

First, let’s make the expectation a little more explicit. Remember, the expectation of the value of a function $f(x)$ is the summation of all the possible values due to variations in *x* multiplied by the probability of *x*, like so:

*trajectory*of the agent “moving” through the environment. It can be defined as:

*T*. This trajectory is the fundamental factor which determines the sum of the rewards – hence $R(\tau)$. This covers the $f(x)$ component of the expectation definition. What about the $P(x)$ component? In this case, it is equivalent to $P(\tau)$ – but what does this actually look like in a reinforcement learning environment?

*given*the agent is in state $s_t$).

*T*to produce the trajectory $\tau$. (Note, the probability of being in the first state, $s_0$, has been excluded from this analysis for simplicity). Now we can substitute $P(\tau)$ and $R(\tau)$ into the original expectation and take the derivative to get to $\nabla J(\theta)$ which is what we need to do the gradient based optimisation. However, to get there, we first need to apply a trick or two.

## Calculating the Policy Gradient

*log*of the softmax output values from the neural network. So, for the first step in the trajectory, the neural network would take the initial states $s_0$ as input, and it would produce a vector of actions $a_0$ with pseudo-probabilities generated by the softmax operation in the final layer.

*T*– in other words, from t=1 to the

*total length of the episode*. Let’s say the episode length was 4 states long – this term would then look like $\gamma^0 r_1 + \gamma^1 r_2 + \gamma^2 r_3$, where $\gamma$ is the discounting factor and is < 1.

*all subsequent states*in the episode. Therefore, in order to execute this method of learning, we can only take gradient learning steps after the

*full episode*has been played to completion. Only after the episode is complete can we perform the training step.

*x*. If we look at the source code of the Keras implementation of cross-entropy, we can see the following:

The *output* tensor here is simply the softmax output of the neural network, which, for our purposes, will be a tensor of size (num_steps_in_episode, num_actions). Note that the log of *output* is calculated in the above. The target value, for our purposes, can be all the discounted rewards calculated at each step in the trajectory, and will be of size (num_steps_in_episode, 1). The summation of the multiplication of these terms is then calculated (*reduce_sum*)*.* Gradient based training in TensorFlow 2 is generally a minimisation of the loss function, however, we want to maximise the calculation as discussed above. The good thing is, the sign of cross entropy calculation shown above is inverted – so we are good to go.

To call this training step utilising Keras, all we have to do is execute something like the following:

network.train_on_batch(states, discounted_rewards)

Here, we supply all the states gathered over the length of the episode, and the discounted rewards at each of those steps. The Keras backend will pass the states through *network,* apply the softmax function, and this will become the *output* variable in the Keras source code snippet above. Likewise, *discounted_rewards* is the same as *target* in the source code snippet above.

Now that we have covered all the pre-requisite knowledge required to build a REINFORCE-type method of Policy Gradient reinforcement learning, let’s have a look at how this can be coded and applied to the Cartpole environment.

## Policy Gradient reinforcement learning in TensorFlow 2 and Keras

In this section, I will detail how to code a Policy Gradient reinforcement learning algorithm in TensorFlow 2 applied to the Cartpole environment. As always, the code for this tutorial can be found on this site’s Github repository.

First, we define the network which we will use to produce $P_{\pi_{\theta}}(a_t|s_t)$ with the state as the input:

GAMMA = 0.95 env = gym.make("CartPole-v0") state_size = 4 num_actions = env.action_space.n network = keras.Sequential([ keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()), keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()), keras.layers.Dense(num_actions, activation='softmax') ]) network.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.Adam())

As can be observed, first the environment is initialised. Next, the network is defined using the Keras Sequential API. The network consists of 3 densely connected layers. The first 2 layers have ReLU activations, and the final layer has a softmax activation to produce the pseudo-probabilities to approximate $P_{\pi_{\theta}}(a_t|s_t)$. Finally, the network is compiled with a cross entropy loss function and an Adam optimiser.

The next part of the code chooses the action from the output of the model:

def get_action(network, state, num_actions): softmax_out = network(state.reshape((1, -1))) selected_action = np.random.choice(num_actions, p=softmax_out.numpy()[0]) return selected_action

As can be seen, first the softmax output is extracted from the network by inputing the current state. The action is then selected by making a random choice from the number of possible actions, with the probabilities weighted according to the softmax values.

The next function is the main function involved in executing the training step:

def update_network(network, rewards, states, actions, num_actions): reward_sum = 0 discounted_rewards = [] for reward in rewards[::-1]: # reverse buffer r reward_sum = reward + GAMMA * reward_sum discounted_rewards.append(reward_sum) discounted_rewards.reverse() discounted_rewards = np.array(discounted_rewards) # standardise the rewards discounted_rewards -= np.mean(discounted_rewards) discounted_rewards /= np.std(discounted_rewards) states = np.vstack(states) loss = network.train_on_batch(states, discounted_rewards) return loss

First, the discounted rewards list is created: this is a list where each element corresponds to the summation from t + 1 to T according to $\sum_{t’= t + 1}^{T} \gamma^{t’-t-1} r_{t’}$. The input argument *rewards* is a list of all the rewards achieved at each step in the episode. The *rewards[::-1]* operation reverses the order of the rewards list, so the first run through the *for *loop will deal with last reward recorded in the episode. As can be observed, a reward sum is accumulated each time the *for *loop is executed. Let’s say that the episode length is equal to 4 – $r_3$ will refer to the last reward recorded in the episode. In this case, the discounted_rewards list would look like:

[$r_3$, $r_2 + \gamma r_3$, $r_1 + \gamma r_2 + \gamma^2 r_3$, $r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3$]

This list is in reverse to the order of the actual state value list (i.e. [$s_0$, $s_1$, $s_2$, $s_3$]), so the next line after the *for *loop reverses the list *(discounted_rewards.reverse()).*

Next, the list is converted into a numpy array, and the rewards are normalised to reduce the variance in the training. Finally, the states list is stacked into a numpy array and both this array and the discounted rewards array are passed to the Keras *train_on_batch *function, which was detailed earlier.

The next part of the code is the main episode and training loop:

num_episodes = 10000000 train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}") for episode in range(num_episodes): state = env.reset() rewards = [] states = [] actions = [] while True: action = get_action(network, state, num_actions) new_state, reward, done, _ = env.step(action) states.append(state) rewards.append(reward) actions.append(action) if done: loss = update_network(network, rewards, states, actions, num_actions) tot_reward = sum(rewards) print(f"Episode: {episode}, Reward: {tot_reward}, avg loss: {loss:.5f}") with train_writer.as_default(): tf.summary.scalar('reward', tot_reward, step=episode) tf.summary.scalar('avg loss', loss, step=episode) break state = new_state

As can be observed, at the beginning of each episode, three lists are created which will contain the state, reward and action values for each step in the episode / trajectory. These lists are appended to until the *done* flag is returned from the environment signifying that the episode is complete. At the end of the episode, the training step is performed on the *network *by running *update_network*. Finally, the rewards and loss are logged in the *train_writer* for viewing in TensorBoard.

The training results can be observed below:

As can be observed, the rewards steadily progress until they “top out” at the maximum possible reward summation for the Cartpole environment, which is equal to 200. However, the user can verify that repeated runs of this version of Policy Gradient training has a high variance in its outcomes. Therefore, improvements in the Policy Gradient REINFORCE algorithm are required and available – these improvements will be detailed in future posts.

### Eager to build deep learning systems in TensorFlow 2? Get the book **here**