In this post, we’ll be covering Dueling Q networks for reinforcement learning in TensorFlow 2. This reinforcement learning architecture is an improvement on the Double Q architecture, which has been covered here. In this tutorial, I’ll introduce the Dueling Q network architecture, it’s advantages and how to build one in TensorFlow 2. We’ll be running the code on the Open AI gym‘s CartPole environment so that readers can train the network quickly and easily. In future posts, I’ll be showing results on Atari environments which are more complicated. For an introduction to reinforcement learning, check out this post and this post. All the code for this tutorial can be found on this site’s Github repo.

### Eager to build deep learning systems in TensorFlow 2? Get the book **here**

## A recap of Double Q learning

As discussed in detail in this post, vanilla deep Q learning has some problems. These problems can be boiled down to two main issues:

- The bias problem: vanilla deep Q networks tend to overestimate rewards in noisy environments, leading to non-optimal training outcomes
- The moving target problem: because the same network is responsible for both the choosing of actions and the evaluation of actions, this leads to training instability

With regards to (1) – say we have a state with two possible actions, each giving noisy rewards. Action *a* returns a random reward based on a normal distribution with a mean of 2 and a standard deviation of 1 – *N(2, 1). *Action *b* returns a random reward from a normal distribution of *N(1, 4)*. On average, action *a* is the optimal action to take in this state – however, because of the *argmax* function in deep Q learning, action *b *will tend to be favoured because of the higher standard deviation / higher random rewards.

For (2) – let’s consider another state, state 1, with three possible actions *a, b, *and *c.* Let’s say we know that *b* is the optimal action. However, when we first initialize the neural network, in state 1, action *a *tends to be chosen. When we’re training our network, the loss function will drive the weights of the network towards choosing action *b.* However, next time we are in state 1, the parameters of the network have changed to such a degree that now action *c *is chosen. Ideally, we would have liked the network to consistently chose action *a *in state 1 until it was gradually trained to chose action *b*. But now the goal posts have shifted, and we are trying to move the network from *c *to *b *instead of *a *to *b* – this gives rise to instability in training. This is the problem that arises when you have the same network both choosing actions and evaluating the worth of actions.

To overcome this problem , Double Q learning proposed the following way of determining the target Q value: $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t)$$ Here $\theta_t$ refers to the primary network parameters (weights) at time *t*, and $\theta^-_t$ refers to something called the *target* *network* parameters at time *t*. This *target network* is a kind of delayed copy of the primary network. As can be observed, the optimal action in state *t + 1* is chosen from the primary network ($\theta_t$) but the evaluation or estimate of the Q value of this action is determined from the *target network *($\theta^-_t$).

This can be shown more clearly by the equations below: $$a* = argmax Q(s_{t+1}, a; \theta_t)$$ $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, a*; \theta^-_t)$$ By doing this two things occur. First, different networks are used to chose the actions and evaluate the actions. This breaks the moving target problem mentioned earlier. Second, the primary network and the target network have essentially been trained on different samples from the memory bank of states and actions (the target network is “trained” on older samples than the primary network). Because of this, any bias due to environmental randomness should be “smoothed out”. As was shown in my previous post on Double Q learning, there is a significant improvement in using Double Q learning instead of vanilla deep Q learning. However, a further improvement can be made on the Double Q idea – the Dueling Q architecture, which will be covered in the next section.

## Dueling Q introduction

The Dueling Q architecture trades on the idea that the evaluation of the Q function implicitely calculates two quantities:

- V(s) – the
*value*of being in state s - A(s, a) – the
*advantage*of taking action*a*in state s

These values, along with the Q function, Q(s, a), are very important to understand, so we will do a deep dive of these concepts here. Let’s first examine the generalised formula for the value function V(s): $$V^{\pi}(s) = \mathbb{E} \left[ \sum_{i=1}^T \gamma^{i – 1}r_{i}\right]$$ The formula above means that the value function at state s, operating under a policy $\pi$, is the summation of future discounted rewards starting from state s. In other words, if an agent starts at *s*, it is the sum of all the rewards the agent collects operating under a given policy $\pi$. The $\mathbb{E}$ is the expectation operator.

Let’s consider a basic example. Let’s assume an agent is playing a game with a set number of turns. In the second-to-last turn, the agent is in state *s*. From this state, it has 3 possible actions, with a reward of 10, 50 and 100 respectively. Let’s say that the policy for this agent is a simple random selection. Because this is the last set of actions and rewards in the game, due to the game finishing next turn, there are no discounted future rewards. The value for this state and the random action policy is: $$V^{\pi}(s) = \mathbb{E} \left[random\left(10, 50, 100)\right)\right] = 53.333$$ Now, clearly this policy is not going to produce optimum outcomes. However, we know that for the optimum policy, the value of this state would be: $$V^*(s) = \max (10, 50, 100) = 100$$ If you recall, from Q learning theory, the optimal action in this state is: $$a* = argmax Q(s_{t+1}, a)$$ and the optimal Q value from this action in this state would be: $$Q(s, a^*) = \max (10, 50, 100) = 100$$ Therefore, under the optimal (deterministic) policy we have: $$Q(s,a^*) = V(s)$$ However, what if we aren’t operating under the optimal policy (yet)? Let’s return to the case where our policy is simple random action selection. In such a case, the Q function at state s could be described as (remember there are no future discounted rewards, and V(s) = 53.333): $$Q(s, a) = V(s) + (-43.33, -3.33, 46.67) = (10, 50, 100)$$ The term (-43.33, -3.33, 46.67) under such an analysis is called the Advantage function A(s, a). The Advantage function expresses the relative benefits of the various actions possible in state *s*. The Q function can therefore be expressed as: $$Q(s, a) = V(s) + A(s, a)$$ Under the optimum policy we have $A(s, a^*) = 0$, $V(s) = 100$ and therefore: $$Q(s, a) = V(s) + A(s, a) = 100 + (-90, -50, 0) = (10, 50, 100)$$ Now the question becomes, why do we want to decompose the Q function in this way? Because there is a difference between the value of a particular state *s* and the actions proceeding from that state. Consider a game where, from a given state *s*, *all actions lead to the agent dying and ending the game. This is an inherently low value state to be in, and who cares about the actions which one can take in such a state? It is pointless for the learning algorithm to waste training resources trying to find the best actions to take. In such a state, the Q values should be based solely on the value function V, and this state should be avoided. The converse case also holds – some states are just inherently valuable to be in, regardless of the effects of subsequent actions.

Consider these images taken from the original Dueling Q paper – showing the value and advantage components of the Q value in the Atari game Enduro:

In the Atari Enduro game, the goal of the agent is to pass as many cars as possible. “Running into” a car slows the agent’s car down and therefore reduces the number of cars which will be overtaken. In the images above, it can be observed that the value stream considers the road ahead and the score. However, the advantage stream, does not “pay attention” to anything much when there are no cars visible. It only begins to register when there are cars close by and an action is required to avoid them. This is a good outcome, as when no cars are in view, the network should not be trying to determine which actions to take as this is a waste of training resources. This is the benefit of splitting value and advantage functions.

Now, you could argue that, because the Q function inherently contains both the value and advantage functions anyway, the neural network should learn to separate out these components regardless. Indeed, it may do. However, this comes at a cost. If the ML engineer already knows that it is important to try and separate these values, why not build them into the architecture of the network and save the learning algorithm the hassle? That is essentially what the Dueling Q network architecture does. Consider the image below showing the original architecture:

First, notice that the first part of architecture is common, with CNN input filters and a common Flatten layer (for more on convolutional neural networks, see this tutorial). After the flatten layer, the network bifurcates – with separate densely connected layers. The first densely connected layer produces a single output corresponding to V(s). The second densely connected layer produces *n* outputs, where *n *is the number of available actions – and each of these outputs is the expression of the advantage function. These value and advantage functions are then aggregated in the Aggregation layer to produce Q values estimations for each possible action in state *s*. These Q values can then be trained to approach the target Q values, generated via the Double Q mechanism i.e.: $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t)$$ The idea is that, through these separate value and advantage streams, the network will learn to produce accurate estimates of the values and advantages, improving learning performance. What goes on in the aggregation layer? One might think we could just add the V(s) and A(s, a) values together like so: $$Q(s, a) = V(s) + A(s, a)$$ However, there is an issue here and it’s called the problem of identifiabilty. This problem in the current context can be stated as follows: given Q, there is no way to uniquely identify V or A. What does this mean? Say that the network is trying to learn some optimal Q value for action *a*. Given this Q value, can we uniquely learn a V(s) and A(s, a) value? Under the formulation above, the answer is no.

Let’s say the “true” value of being in state *s* is 50 i.e. V(s) = 50. Let’s also say the “true” advantage in state *s* for action *a *is 10. This will give a Q value, Q(s, a) of 60 for this state and action. However, we can also arrive at the same Q value for a learned V(s) of, say, 0, and an advantage function A(s, a) = 60. Or alternatively, a learned V(s) of -1000 and an advantage A(s, a) of 1060. In other words, there is no way to guarantee the “true” values of V(s) and A(s, a) are being learned separately and uniquely from each other. The commonly used solution to this problem is to instead perform the following aggregation function: $$Q(s,a) = V(s) + A(s,a) – \frac{1}{\|a\|}\sum_{a’}A(s,a’)$$ Here the advantage function value is normalized with respect to the mean of the advantage function values over all actions in state *s*.

In TensorFlow 2.0, we can create a common “head” network, consisting of introductory layers which act to process the images or other environmental / state inputs. Then, two separate streams are created using densely connected layers which learn the value and advantage estimates, respectively. These are then combined in a special aggregation layer which calculates the equation above to finally arrive at Q values. Once the network architecture is specified in accordance with the above description, the training proceeds in the same fashion as Double Q learning. The agent actions can be selected either directly from the output of the advantage function, or from the output Q values. Because the Q values differ from the advantage values only by the addition of the V(s) value (which is independent of the actions), the argmax-based selection of the best action will be the same regardless of whether it is extracted from the advantage or the Q values of the network.

In the next section, the implementation of a Dueling Q network in TensorFlow 2.0 will be demonstrated.

## Dueling Q network in TensorFlow 2

In this section we will be building a Dueling Q network in TensorFlow 2. However, the code will be written so that both Double Q and Dueling Q networks will be able to be constructed with the simple change of a boolean identifier. The environment that the agent will train in is Open AI Gym’s CartPole environment. In this environment, the agent must learn to move the cart platform back and forth in order to stop a pole falling too far below the vertical axis. While Dueling Q was originally designed for processing images, with its multiple CNN layers at the beginning of the model, in this example we will be replacing the CNN layers with simple dense connected layers. Because training reinforcement learning agents using images only (i.e. Atari RL environments) takes a long time, in this introductory post, only a simple environment is used for training the model. Future posts will detail how to efficiently train in Atari RL environments. All the code for this tutorial can be found on this site’s Github repo.

First of all, we declare some constants that will be used in the model, and initiate the CartPole environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard' MAX_EPSILON = 1 MIN_EPSILON = 0.01 EPSILON_MIN_ITER = 5000 DELAY_TRAINING = 300 GAMMA = 0.95 BATCH_SIZE = 32 TAU = 0.08 RANDOM_REWARD_STD = 1.0 env = gym.make("CartPole-v0") state_size = 4 num_actions = env.action_space.n

The MAX_EPSILON and MIN_EPSILON variables define the maximum and minimum values of the epsilon-greedy variable which will determine how often random actions are chosen. Over the course of the training, the epsilon-greedy parameter will decay from MAX_EPSILON gradually to MIN_EPSILON. The EPSILON_MIN_ITER value specifies how many training steps it will take before the MIN_EPSILON value is obtained. The DELAY_TRAINING constant specifies how many iterations should occur, with the memory buffer being filled, before training of the network is undertaken. The GAMMA value is the future reward discount value used in the Q-target equation, and TAU is the merging rate of the weight values between the primary network and the target network as per the Double Q learning algorithm. Finally, RANDOM_REWARD_STD is the standard deviation of the rewards that introduces some stochastic behaviour into the otherwise deterministic CartPole environment.

After the definition of all these constants, the CartPole environment is created and the state size and number of actions are defined.

### Model definition

The next step in the code is to create a Keras model inherited class which defines the Double or Dueling Q network:

class DQModel(keras.Model): def __init__(self, hidden_size: int, num_actions: int, dueling: bool): super(DQModel, self).__init__() self.dueling = dueling self.dense1 = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.dense2 = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.adv_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.adv_out = keras.layers.Dense(num_actions, kernel_initializer=keras.initializers.he_normal()) if dueling: self.v_dense = keras.layers.Dense(hidden_size, activation='relu', kernel_initializer=keras.initializers.he_normal()) self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal()) self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x)) self.combine = keras.layers.Add() def call(self, input): x = self.dense1(input) x = self.dense2(x) adv = self.adv_dense(x) adv = self.adv_out(adv) if self.dueling: v = self.v_dense(x) v = self.v_out(v) norm_adv = self.lambda_layer(adv) combined = self.combine([v, norm_adv]) return combined return adv

Let’s go through the above line by line. First, a number of parameters are passed to this model as part of its initialization – these include the size of the hidden layers of the advantage and value streams, the number of actions in the environment and finally a Boolean variable, *dueling*, to specify whether the network should be a standard Double Q network or a Dueling Q network. The first two model layers defined are simple Keras densely connected layers, *dense1* and *dense2*. These layers have ReLU activations and use the He normal weight initialization. The next two layers defined *adv_dense* and *adv_out* pertain to the advantage stream of the network, provided we are discussing a Dueling Q network architecture. If in fact the network is to be a Double Q network (i.e. dueling == False), then these names are a bit misleading and will simply be a third densely connected layer followed by the output Q layer *(adv_out).* However, keeping with the Dueling Q terminology, the first dense layer associated with the advantage stream is simply another standard dense layer of size = hidden_size. The final layer in this stream, *adv_out* is a dense layer with only *num_actions* outputs – each of these outputs will learn to estimate the advantage of all the actions in the given state (A(s, a)).

If the network is specified to be a Dueling Q network (i.e. dueling == True), then the value stream is also created. Again, a standard densely connected layer of size = hidden_size is created (*v_dense*). Then a final, single node dense layer is created to output the single value estimation (V(s)) for the given state. These layers specify the advantage and value streams respectively. Now the aggregation layer is to be created. This aggregation layer is created by using two Keras layers – a Lambda layer and an Add layer. The Lambda layer allows the developer to specify some user-defined operation to perform on the inputs to the layer. In this case, we want the layer to calculate the following: $$A(s,a) – \frac{1}{\|a\|}\sum_{a’}A(s,a’)$$ This is calculated easily by using the *lambda x: x – tf.reduce_mean(x) *expression in the Lambda layer. Finally, we need a simple Keras addition layer to add this mean-normalized advantage function to the value estimation.

This completes the explanation of the layer definitions in the model. The *call* method in this model definition then applies these various layers to the state inputs of the model. The following two lines execute the Dueling Q aggregation function:

norm_adv = self.lambda_layer(adv) combined = self.combine([v, norm_adv])

Note that first the mean-normalizing lambda function is applied to the output from the advantage stream. This normalized advantage is then added to the value stream output to produce the final Q values (*combined*). Now that the model class has been defined, it is time to instantiate two models – one for the primary network and the other for the target network:

primary_network = DQModel(30, num_actions, True) target_network = DQModel(30, num_actions, True) primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse') # make target_network = primary_network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(e)

After the *primary_network *and *target_network* have been created, only the *primary_network *is compiled as only the primary network is actually trained using the optimization function. As per Double Q learning, the target network is instead moved slowly “towards” the primary network by the gradual merging of weight values. Initially however, the target network trainable weights are set to be equal to the primary network trainable variables, using the TensorFlow assign function.

### Other functions

The next function to discuss is the target network updating which is performed during training. In Double Q network training, there are two options for transitioning the target network weights towards the primary network weights. The first is to perform a wholesale copy of the weights every N training steps. Alternatively, the weights can be moved towards the primary network gradually every training iteration as follows:

def update_network(primary_network, target_network): # update target network parameters slowly from primary network for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables): t.assign(t * (1 - TAU) + e * TAU)

As can be observed, the new target weight variables are a weighted average between the current weight values and the primary network weights – with the weighting factor equal to TAU. The next code snippet is the definition of the memory class:

class Memory: def __init__(self, max_memory): self._max_memory = max_memory self._samples = [] def add_sample(self, sample): self._samples.append(sample) if len(self._samples) > self._max_memory: self._samples.pop(0) def sample(self, no_samples): if no_samples > len(self._samples): return random.sample(self._samples, len(self._samples)) else: return random.sample(self._samples, no_samples) @property def num_samples(self): return len(self._samples) memory = Memory(500000)

This class takes tuples of (state, action, reward, next state) values and appends them to a memory list, which is randomly sampled from when required during training. The next function defines the epsilon-greedy action selection policy:

def choose_action(state, primary_network, eps): if random.random() < eps: return random.randint(0, num_actions - 1) else: return np.argmax(primary_network(state.reshape(1, -1)))

If a random number sampled between the interval 0 and 1 falls below the current epsilon value, a random action is selected. Otherwise, the current state is passed to the primary model – from which the Q values for each action are returned. The action with the highest Q value, selected by the numpy argmax function, is returned.

The next function is the *train *function, where the training of the primary network takes place:

def train(primary_network, memory, target_network): batch = memory.sample(BATCH_SIZE) states = np.array([val[0] for val in batch]) actions = np.array([val[1] for val in batch]) rewards = np.array([val[2] for val in batch]) next_states = np.array([(np.zeros(state_size) if val[3] is None else val[3]) for val in batch]) # predict Q(s,a) given the batch of states prim_qt = primary_network(states) # predict Q(s',a') from the evaluation network prim_qtp1 = primary_network(next_states) # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action target_q = prim_qt.numpy() updates = rewards valid_idxs = np.array(next_states).sum(axis=1) != 0 batch_idxs = np.arange(BATCH_SIZE) # extract the best action from the next state prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1) # get all the q values for the next state q_from_target = target_network(next_states) # add the discounted estimated reward from the selected action (prim_action_tp1) updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]] # update the q target to train towards target_q[batch_idxs, actions] = updates # run a training batch loss = primary_network.train_on_batch(states, target_q) return loss

For a more detailed explanation of this function, see my Double Q tutorial. However, the basic operations that are performed are expressed in the following formulas: $$a* = argmax Q(s_{t+1}, a; \theta_t)$$ $$Q_{target} = r_{t+1} + \gamma Q(s_{t+1}, a*; \theta^-_t)$$ The best action from the next state, a*, is selected from the primary network (weights = $\theta_t$). However, the Q value for this action in the next state ($s_{t+1}$) is extracted from the target network (weights = $\theta^-_t$). A Keras train_on_batch operation is performed by passing a batch of states and subsequent target Q values, and the loss is finally returned from this function.

### The main Dueling Q training loop

The main training loop which trains our Dueling Q network is shown below:

num_episodes = 1000000 eps = MAX_EPSILON render = False train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}") steps = 0 for i in range(num_episodes): cnt = 1 avg_loss = 0 tot_reward = 0 state = env.reset() while True: if render: env.render() action = choose_action(state, primary_network, eps) next_state, _, done, info = env.step(action) reward = np.random.normal(1.0, RANDOM_REWARD_STD) tot_reward += reward if done: next_state = None # store in memory memory.add_sample((state, action, reward, next_state)) if steps > DELAY_TRAINING: loss = train(primary_network, memory, target_network) update_network(primary_network, target_network) else: loss = -1 avg_loss += loss # linearly decay the eps value if steps > DELAY_TRAINING: eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * \ (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else \ MIN_EPSILON steps += 1 if done: if steps > DELAY_TRAINING: avg_loss /= cnt print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}") with train_writer.as_default(): tf.summary.scalar('reward', cnt, step=i) tf.summary.scalar('avg loss', avg_loss, step=i) else: print(f"Pre-training...Episode: {i}") break state = next_state cnt += 1

Again, this training loop has been explained in detail in the Double Q tutorial. However, some salient points are worth highlighting. First, Double and Dueling Q networks are superior to vanilla Deep Q networks especially in the cases where there is some stochastic component to the environment. As the CartPole environment is deterministic, some stochasticity is added in the reward. Normally, every time step in the episode results in a reward of 1 i.e. the CartPole has survived another time step – good job. However, in this case I’ve added a reward which is sampled from a normal distribution with a mean of 1.0 but a standard deviation of RANDOM_REWARD_STD. This adds the requisite uncertainty which makes Double and Dueling Q networks clearly superior to Deep Q networks – see my Double Q tutorial for a demonstration of this.

Another point to highlight is that training of the primary (and by extension target) network until DELAY_TRAINING steps have been exceeded. Also, the epsilon value for the epsilon-greedy action selection policy doesn’t decay until these DELAY_TRAINING steps have been exceeded.

### Dueling Q vs Double Q results

A comparison of the training progress with respect to the deterministic reward of the agent in the CartPole environment under Double Q and Dueling Q architectures can be observed in the figure below, with the x-axis being the number of episodes:

As can be observed, there is a slightly higher performance of the Double Q network with respect to the Dueling Q network. However, the performance difference is fairly marginal, and may be within the variation arising from the random weight initialization of the networks. There is also the issue of the Dueling Q network being slightly more complicated due to the additional value stream. As such, on a fairly simple environment like the CartPole environment, the benefits of Dueling Q over Double Q may not be realized. However, in more complex environments like Atari environments, it is likely that the Dueling Q architecture will be superior to Double Q (this is what the original Dueling Q paper has shown). Future posts will demonstrate the Dueling Q architecture in Atari environments.