PPO Proximal Policy Optimization reinforcement learning in TensorFlow 2

In previous posts (here and here), I have been covering policy gradient-based reinforcement learning methods. In this post, I will continue the series by covering another pseudo-policy gradient based method called Proximal Policy Optimization (PPO). The PPO technique was designed to make some improvements on the Trust Region Policy Optimization (TRPO) algorithm, which in turn was designed to improve the Advantage Actor Critic (A2c) method. In this post, I’ll explain the theory surrounding the development of Proximal Policy Optimization and provide a worked coding example using PPO to train an agent to play the Cartpole environment from the Open AI Gym.

All code contained in this post can be found on this site’s Github repository here.

PPO theory

A policy gradient and A2C recap

A policy gradient-based method of reinforcement learning selection agent actions based on the output of a neural network, with each output corresponding to the probability that a certain action should be taken. This probability distribution is sampled from to produce actions during training.

The A2c algorithm within the family of policy gradient-based reinforcement learning methods has a policy gradient term equal to:

$$\nabla_\theta J(\theta) \sim \left(\sum_{t=0}^{T-1} log P_{\pi_{\theta}}(a_t|s_t)\right)A(s_t, a_t)$$

It should be noted that the above term is for gradient ascent, not descent i.e. this term should be maximized. Recall that the advantage is expressed as:

$$A(s_t, a_t) = Q(s_t, a_t) – V(s_t)$$

Which is a kind of “relative” or normalized benefit of taking action $a_t$ from state $s_t$, with the normalization arising from the overall value of being in that state $V(s_t)$ (for further discussion on the advantage and value estimates, see this post). This loss weights the log probability of taking the action with the advantage. The probability of taking the action is generated from the policy output of the neural network, with a softmax activation applied. So if an action is taken with high probability, let’s say 0.8, and this results in a high advantage – then this policy output will be reinforced. If the advantage is low, alternatively, then this high probability policy output will be discouraged.

Via Bellman’s equation, the advantage can actually be estimated purely from the value estimate:

$$A(s_t, a_t) = r_{t+1} + \gamma V(s_{t+1}) – V(s_t)$$

This enables us to create an advantaged weighted policy gradient method by also estimating the value of any given state $V(s_t)$. This requires a two-output neural network, one for estimating action probabilities (actor part) and the other for state values (critic part), with architecture as shown below:

A2C architecture

A2C architecture


The architecture shown below is for a network whose state inputs are game pixels. For other state types, such as the state variables for the Cartpole environment, the input layers are usually densely connected. The loss function for the value output in the A2C architecture is usually simply the mean-squared error between the predicted values and the actual values calculated from the discounted future rewards.

What’s the disadvantage with the A2C method?

There are a couple of disadvantages of the A2C method. These can be summarised as follows:

  1. Large gradient steps can move the network weights into sub-optimal areas and wreck (or at least slow) the training progress
  2. There is poor sample efficiency – the batch samples can only inform the network training once

These disadvantages were addressed in the Trust Region Policy Optimization (TRPO) method, which ultimately leads to the PPO method, and will be discussed in the following section in that context.

Trust Region Policy Optimization

There are some very interesting ideas in TRPO, however, in this post, I will only be covering them briefly without too much theory.

Improving sampling efficiency

The first problem to address is the poor sample efficiency of A2C (and other Policy Gradient methods). A set of training samples are collected by playing the game, which in reinforcement learning can be a relatively lengthy process, and they are only used to train the network once. This is not great from a computational perspective, especially given that reinforcement learning may require hundreds of thousands or millions of training steps to achieve good results. In policy gradient-based methods, ideally we would like to use samples generated via the rolling out of a trajectory using an older version of the policy, to train a new policy.

But how can we do this in a mathematically valid way? It turns out we can use the concept of Importance Sampling (IS). Importance sampling allows us to compute the expected value of function $f(x)$, where $x$ is sampled from a distribution $p(x)$, by instead sampling $x$ from another distribution $q(x)$. That’s probably a bit confusing, but this is what it looks like mathematically:

$$\mathbb{E}_{x \sim p(x)} \left[f(x)\right] = \mathbb{E}_{x \sim q(x)} \left[\frac{p(x)}{q(x)} f(x) \right]$$

Where $x \sim p(x)$ in the expectation subscript refers to $x$ being sampled from $p(x)$ and the same goes for $x \sim q(x)$. We can apply this formula to the case where we might want to use actions, states and advantage values sampled during the roll-out of an old policy $\pi_{\theta old}$ to train the current policy. First recall that the policy gradient part of the loss for A2C, with the expectation still shown (before it is cashed out into an empirical sum sampled from a trajectory), looks like:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_{\theta} log P_{\pi_{\theta}}(a_t|s_t) A(s_t, a_t)\right]$$

Now, using importance sampling, we can replace this with:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[\frac{\nabla_{\theta} P_{\pi_{\theta}}(a_t|s_t)}{P_{\pi_{\theta old}}(a_t|s_t)} A_{\theta old}(s_t, a_t)\right]$$

You may be wondering where the $log$ went in the formula above, however it turns out that the derivative of both of these expressions are equivalent for small step sizes, as discussed in the original TRPO paper.

This use of importance sampling now allows the same batch of samples from a trajectory rollout to be used many times to train the current policy, making the policy gradient method much more sample efficient.

Improving step stability

As mentioned above, one of the issues of the A2C method of reinforcement learning is that it is unstable. In other words, the gradient steps during training can derail the agent’s learning process. Consider the image below of the Hilary Step on Mount Everest:

Hilary Step – Mt Everest By Debasish biswas kolkata – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=46637445

Imagine the training process of the agent is to ascend the very narrow path to the summit, which represents optimal agent behavior. However, as can be observed, the path is very narrow, and a step too large in any direction will send the agent “over the cliff” perhaps permanently damaging the training process for the agent. This is perhaps an extreme example, though maybe not – reinforcement learning optimization landscapes are infamously sharp and “treacherous”. In any case, this fact restricts both the speed of the training and the learning rates that are possible during training. If the learning rates or steps are too high, then there will be a greater risk of derailing the training process.

This is especially a problem if we are hoping to improve sample efficiency and run multiple training steps over the same set of trajectory samples. This is where the “trust region” part of TRPO comes in. Trust region optimization involves creating a small area around the current point which specifies the maximum radius of a step that can be taken in the next gradient descent/ascent iteration. Consider the two diagrams below, from this good overview of trust-region methods:

First trust region example

First trust region example

Second trust region example

Second trust region example

The dark circles around the points in the diagrams above are the trust regions, which define the boundaries of the maximum next step in whatever gradient method is being used. The trust region is either reduced or expanded depending on how well the last step performed – if it resulted in a significant improvement, the trust region size is expanded, if it resulted in a degradation in performance, the trust region size is reduced. However, the trust region size can also be a simple fixed hyper-parameter, as is the case in TRPO as will be shown below.

The optimization problem solved in TRPO is approximately as follows:

$$J(\theta) = \mathbb{E} \left[\frac{P_{\pi_{\theta}}(a_t|s_t)}{P_{\pi_{\theta old}}(a_t|s_t)} A_{\theta old}(s_t, a_t)\right]$$

With an optimization constraint of:

$$D_{KL}(\theta | \theta_{old}) \leq \delta $$

Here $D_{KL}$ is the Kullback–Leibler divergence (KL divergence). The KL divergence is a way of measuring the distance between two probability distributions, so this constraint ensures that the new parameters of the policy aren’t too different from the old parameters. In other words, this constraint enforces a trust region on the parameter / policy updates.

In the original paper, the conjugate gradient optimization method is used to solve this optimization problem with the aforementioned $D_{KL}(\theta | \theta_{old})$ based constraint. However, compared to the standard optimization methods commonly used in deep learning, the conjugate gradient is slow. Therefore, an improvement of TRPO that removes the need for strict constraints and the use of conjugate gradient optimization, while still keeping both the trust region concept and enabling the improvement of sample efficiency, would clearly be desirable. Proximal Policy Optimization (PPO) is one way of achieving these goals.

Proximal Policy Optimization (PPO)

As shown above, the importance sampling version of the policy loss function, which enables better sample efficiency, is expressed as follows in the original PPO paper:

$$L^{CPI}(\theta) = \mathbb{E}_t \left[\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}A_t\right] = \mathbb{E}_t\left[r_t(\theta)A_t\right]$$

Here the superscipt CPI stands for “conservative policy iteration”, and the $\pi_\theta (a_t | s_t)$ notation expresses the same probability as shown previously (i.e. $P_{\pi_{\theta}}(a_t|s_t)$). This loss function, while allowing greater sample efficiency by using importance sampling, as discussed above, will still be subject to large gradient updates derailing training progress. In order to introduce a quasi-trust region limitation, while not introducing conjugate gradient-based optimization or strict constraints, the authors of the PPO paper introduce the following alternative loss function:

$$L^{CLIP}(\theta) = \mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon)A_t)\right]$$

Where $\epsilon$ is a hyperparameter recommended to be between 0.1 to 0.3. This CLIP loss function is essentially the same as the CPI loss function, apart from when the probability of an action under the new parameters ($\pi_\theta (a_t | s_t)$) is greater than the old parameters by a certain degree (i.e. r > 1 + $\epsilon$), after which the loss value is clipped. This ensures that no large steps can occur in the gradient updates, which is essentially enforcing a trust region. The plots below for both positive and negative advantage values illustrate the behavior of $L^{CLIP}$ with respect to $r$:

PPO clipped loss function

PPO clipped loss function (from https://arxiv.org/pdf/1707.06347.pdf)

The total loss function for the PPO method consists of $L^{CLIP}$, the mean-square error loss of the value estimator, and a term to encourage higher entropy / greater exploration (with the latter two terms identical to the components in A2C). It can be expressed as follows, with signs introduced to convert a maximization function to a minimization function:

$$L_{TOTAL} = -L^{CLIP} + L^{VALUE} * k_1 – L^{ENTROPY} * k_2$$

Where the $k$ values are hyperparameters that weigh the various components of the loss against $L^{CLIP}$ (with $k_1$ usually on the order of 0.5, and $k_2$ on the order of 0.01). In the following section, I will present a code work-through of PPO being utilized to train a CartPole agent.

PPO code walk-through

As stated above, all code in this walk-through can be found on this site’s Github repository. The RL environment that will be used is Open AI’s Cartpole environment. In this example, I’ll be making use of TensorFlow’s GradientTape functionality and the Keras model API.

The code below shows the establishment of some constants and the creation of the neural network model structure:

import tensorflow as tf
from tensorflow import keras
import tensorflow_probability as tfp
import numpy as np
import gym
import datetime as dt

STORE_PATH = 'C:\\Users\\andre\\TensorBoard\\PPOCartpole'
GAMMA = 0.99
LR = 0.001


env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

ent_discount_val = ENTROPY_LOSS_WEIGHT

class Model(keras.Model):
    def __init__(self, num_actions):
        self.num_actions = num_actions
        self.dense1 = keras.layers.Dense(64, activation='relu',
        self.dense2 = keras.layers.Dense(64, activation='relu',
        self.value = keras.layers.Dense(1)
        self.policy_logits = keras.layers.Dense(num_actions)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        return self.value(x), self.policy_logits(x)

    def action_value(self, state):
        value, logits = self.predict_on_batch(state)
        dist = tfp.distributions.Categorical(logits=logits)
        action = dist.sample()
        return action, value

Most of the constants declared in the first part of the code above are familiar from the A2C tutorial – the critic and entropy loss weight values, gamma value (reward discounting factor), and so on. However, a few new values need to be discussed. First, I have included an entropy discounting rate (ENT_DISCOUNT_RATE) – after each episode, the entropy loss weight value will be discounted by this value to discourage exploration as the number of episodes played piles up. The constant CLIP_VALUE refers to the value $\epsilon$ in the $L^{CLIP}$ loss function ($\mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon)A_t)\right]$). The value NUM_TRAIN_EPOCHS refers to the number of times a given trajectory of rewards, actions, states, etc. will be used to train the network, making use of the importance sampling configuration of the loss function.

The Model class is a fairly standard Keras model API implementation, with two common dense layers and a separate value/critic output and a policy logits output. For more details on this model implementation, see the A2C tutorial.

The loss functions

Next, we’ll review the custom loss functions:

def critic_loss(discounted_rewards, value_est):
    return tf.cast(tf.reduce_mean(keras.losses.mean_squared_error(discounted_rewards, value_est)) * CRITIC_LOSS_WEIGHT,

First the critic loss – this loss function calculates the mean squared error between the discounted rewards (extracted by discounting the trajectory of rewards collated by running the agent in the environment) and the predicted values (value_est) from the neural network from the trajectory of states.

Next, we look at the entropy loss:

def entropy_loss(policy_logits, ent_discount_val):
    probs = tf.nn.softmax(policy_logits)
    entropy_loss = -tf.reduce_mean(keras.losses.categorical_crossentropy(probs, probs))
    return entropy_loss * ent_discount_val

In this loss function, the policy logits directly from the neural network (and based on the trajectory of states sampled from the environment) and the discounted entropy weight (ent_discount_val) are passed as arguments. The policy logits are then converted to quasi-probabilities by applying the softmax function, and then the entropy calculation is performed by applying the Keras categorical cross-entropy function on probs twice (again, for an explanation of how this works, see here). Notice the minus sign to invert the entropy calculation – in a minimization optimization calculation, this will act to enhance the entropy / exploration of the agent. The discounted entropy weight value is then applied to the entropy and the product is returned.

def actor_loss(advantages, old_probs, action_inds, policy_logits):
    probs = tf.nn.softmax(policy_logits)
    new_probs = tf.gather_nd(probs, action_inds)

    ratio = new_probs / old_probs

    policy_loss = -tf.reduce_mean(tf.math.minimum(
        ratio * advantages,
        tf.clip_by_value(ratio, 1.0 - CLIP_VALUE, 1.0 + CLIP_VALUE) * advantages
    return policy_loss

The actor loss takes in the advantages (calculated in another function, which will be explained as follows), the old probability values (again, calculated elsewhere), the action indices, and the policy logits from the neural network model. The action indices are the array indices of the actions actually taken during the trajectory roll-out. In other words, if, for a certain state in the game, the action probabilities are [0.01, 0.3, 0.5, 0.2] and the 0.5 action is taken, the action index, in this case, is 2 (with a zero-based index system). The action_inds is a tensor indices which correspond to the action taken for every record in the batch.

The first line of the function is where the policy logit values are converted to quasi-probabilities using the softmax function. These new policy logits are generated by using the most recent neural network parameters – in other words, after the softmax has been applied, they refer to the $\pi_\theta (a_t | s_t)$ values in the $L^{CLIP}$ loss function. The old_probs values refer to the $\pi_{\theta old}(a_t | s_t)$ values. We are only after the action probabilities that correspond to the actions actually taken during the agent’s playing of the game. The tf.gather_nd function, therefore, takes the probs and extracts those probability outputs which correspond to the actions that were actually taken, specified by the action_inds. This action_inds based selection has already been performed on the probabilities from the neural network parameterized by the old network weight values $\theta_{old}$, as will be shown later.

Then the r value (i.e. the $\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}$ value) is computed. After this, the clipping function is applied as per the $L^{CLIP}$ calculation, shown again below for reference:

$$L^{CLIP}(\theta) = \mathbb{E}_t\left[min(r_t(\theta))A_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon)A_t)\right]$$

Notice the negative sign in the calculation of policy_loss – this ensures that $L^{CLIP}$ is maximized during the minimization optimization that will be performed in the TensorFlow environment.

Next, I introduce the train_model function which calls the various loss function and makes the gradient step:

def train_model(action_inds, old_probs, states, advantages, discounted_rewards, optimizer, ent_discount_val):
    with tf.GradientTape() as tape:
        values, policy_logits = model.call(tf.stack(states))
        act_loss = actor_loss(advantages, old_probs, action_inds, policy_logits)
        ent_loss = entropy_loss(policy_logits, ent_discount_val)
        c_loss = critic_loss(discounted_rewards, values)
        tot_loss = act_loss + ent_loss + c_loss
    grads = tape.gradient(tot_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return tot_loss, c_loss, act_loss, ent_loss

This function takes advantage of the tf.GradientTape() context manager – for more details, see this post. Within this context, the model is called on the list of states collected during the unrolling of a trajectory of an agent through the game. The states variable will be a list of input states, of length equal to BATCH_SIZE. The model returns value and policy logit estimates. These values are then passed through to the various loss functions and summated. The gradients of the trainable variables with respect to this total loss value are then calculated, and an optimization step is undertaken using the apply_gradients method.

The next function calculates the discounted rewards and the advantages:

def get_advantages(rewards, dones, values, next_value):
    discounted_rewards = np.array(rewards + [next_value[0]])

    for t in reversed(range(len(rewards))):
        discounted_rewards[t] = rewards[t] + GAMMA * discounted_rewards[t+1] * (1-dones[t])
    discounted_rewards = discounted_rewards[:-1]
    # advantages are bootstrapped discounted rewards - values, using Bellman's equation
    advantages = discounted_rewards - np.stack(values)[:, 0]
    # standardise advantages
    advantages -= np.mean(advantages)
    advantages /= (np.std(advantages) + 1e-10)
    # standardise rewards too
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= (np.std(discounted_rewards) + 1e-8)
    return discounted_rewards, advantages

This function has been discussed in previous tutorials, for instance, this post. In this case, both the advantages (used in the actor loss function) and the discounted rewards (used in the critic loss function) are normalized to improve training stability.

model = Model(num_actions)
optimizer = keras.optimizers.Adam(learning_rate=LR)

train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PPO-CartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")

num_steps = 10000000
episode_reward_sum = 0
state = env.reset()
episode = 1
total_loss = None
for step in range(num_steps):
    rewards = []
    actions = []
    values = []
    states = []
    dones = []
    probs = []
    for _ in range(BATCH_SIZE):
        _, policy_logits = model(state.reshape(1, -1))

        action, value = model.action_value(state.reshape(1, -1))
        new_state, reward, done, _ = env.step(action.numpy()[0])

        episode_reward_sum += reward

        state = new_state

        if done:
            state = env.reset()
            if total_loss is not None:
                print(f"Episode: {episode}, latest episode reward: {episode_reward_sum}, "
                      f"total loss: {np.mean(total_loss)}, critic loss: {np.mean(c_loss)}, "
                      f"actor loss: {np.mean(act_loss)}, entropy loss {np.mean(ent_loss)}")
            with train_writer.as_default():
                tf.summary.scalar('rewards', episode_reward_sum, episode)
            episode_reward_sum = 0

            episode += 1

    _, next_value = model.action_value(state.reshape(1, -1))
    discounted_rewards, advantages = get_advantages(rewards, dones, values, next_value[0])

    actions = tf.squeeze(tf.stack(actions))
    probs = tf.nn.softmax(tf.squeeze(tf.stack(probs)))
    action_inds = tf.stack([tf.range(0, actions.shape[0]), tf.cast(actions, tf.int32)], axis=1)

    total_loss = np.zeros((NUM_TRAIN_EPOCHS))
    act_loss = np.zeros((NUM_TRAIN_EPOCHS))
    c_loss = np.zeros(((NUM_TRAIN_EPOCHS)))
    ent_loss = np.zeros((NUM_TRAIN_EPOCHS))
    for epoch in range(NUM_TRAIN_EPOCHS):
        loss_tuple = train_model(action_inds, tf.gather_nd(probs, action_inds),
                                 states, advantages, discounted_rewards, optimizer,
        total_loss[epoch] = loss_tuple[0]
        c_loss[epoch] = loss_tuple[1]
        act_loss[epoch] = loss_tuple[2]
        ent_loss[epoch] = loss_tuple[3]
    ent_discount_val *= ENT_DISCOUNT_RATE

    with train_writer.as_default():
        tf.summary.scalar('tot_loss', np.mean(total_loss), step)
        tf.summary.scalar('critic_loss', np.mean(c_loss), step)
        tf.summary.scalar('actor_loss', np.mean(act_loss), step)
        tf.summary.scalar('entropy_loss', np.mean(ent_loss), step)

The majority of this trajectory roll-out / training loop has been covered in this post. The main difference is the loop over NUM_TRAIN_EPOCHS:

for epoch in range(NUM_TRAIN_EPOCHS):
    loss_tuple = train_model(action_inds, tf.gather_nd(probs, action_inds),
                             states, advantages, discounted_rewards, optimizer,
    total_loss[epoch] = loss_tuple[0]
    c_loss[epoch] = loss_tuple[1]
    act_loss[epoch] = loss_tuple[2]
    ent_loss[epoch] = loss_tuple[3]

In this section of the code, the train_model function is run NUM_TRAIN_EPOCHS times. This can be performed in PPO using the importance sampling functionality i.e.:

$$L^{CPI}(\theta) = \mathbb{E}_t \left[\frac{\pi_\theta (a_t | s_t)}{\pi_{\theta old}(a_t | s_t)}A_t\right] = \mathbb{E}_t\left[r_t(\theta)A_t\right]$$

In this case, probs are the probabilities extracted from the model over the rolling out of the trajectory by the agent. In other words, they are the probabilities generated from the neural network with the parameters of the network based on the previous round of training. Therefore, these are the old probabilities or $\pi_{\theta old}(a_t | s_t)$ in the formula above. The new probabilities $\pi_{\theta}(a_t | s_t)$ are generated within the gradient tape context of train_model – a new set of parameters for each training loop. However, through each of these iterations, the old probabilities, or probs, stay constant.

Running this code on the CartPole environment yields the following rewards during training:

PPO Cartpole rewards

PPO Cartpole rewards

As can be observed, the agent manages to score the maximum Cartpole reward of 200 quite early on in the training process, however, it takes some time to achieve this result consistently. Further fine-tuning of the entropy reduction factor, or other hyperparameters, would likely yield a more consistent high score sooner.

This concludes my introductory post on the Proximal Policy Optimization (PPO) method and its implementation in TensorFlow 2. I hope it was informative for you.