# Reinforcement learning tutorial with TensorFlow

Reinforcement learning has gained significant attention with the relatively recent success of DeepMind’s AlphaGo system defeating the world champion Go player. The AlphaGo system was trained in part by reinforcement learning on deep neural networks. This type of learning is a different aspect of machine learning from the classical supervised and unsupervised paradigms. In reinforcement learning using deep neural networks, the network reacts to environmental data (called the state) and controls the actions of an agent to attempt to maximize a reward. This process allows a network to learn to play games, such as Atari or other video games, or any other problem that can be recast as some form of game. In this tutorial, I’ll introduce the broad concepts of Q learning, a popular reinforcement learning paradigm, and I’ll show how to implement deep Q learning in TensorFlow. If you need to get up to speed in TensorFlow, check out my introductory tutorial.

Recommended online course – If you are more of a video learner, check out this inexpensive online course: Advanced AI: Deep Reinforcement Learning in Python

# Introduction to reinforcement learning

As stated above, reinforcement learning comprises of a few fundamental entities or concepts. They are: an environment which produces a state and reward, and an agent which performs actions in the given environment. This interaction can be seen in the diagram below:

The goal of the agent in such an environment is to examine the state and the reward information it receives, and choose an action which maximizes the reward feedback it receives.  The agent learns by repeated interaction with the environment, or, in other words, repeated playing of the game.

To be successful, the agent needs to:

1. Learn the interaction between states, actions and subsequent rewards
2. Determine which is the best action to choose given (1)

The implementation of (1) involves determining some set of values which can be used to inform (2), and (2) is called the action policy. One of the most common ways of implementing (1) and (2) using deep learning is via the Deep Q network and the epsilon-greedy policy. I’ll cover both of these concepts in the next two sections.

## Q learning

Q learning is a value based method of supplying information to inform which action an agent should take. An initially intuitive idea of creating values upon which to base actions is to create a table which sums up the rewards of taking action in state s over multiple game plays. This could keep track of which moves are the most advantageous. For instance, let’s consider a simple game which has 3 states and two possible actions in each state – the rewards for this game can be represented in a table:

In the table above, you can see that for this simple game, when the agent is State 1 and takes Action 2, it will receive a reward of 10 but zero reward if it takes Action 1. In State 2, the situation is reversed and finally State 3 resembles State 1. If an agent randomly explored this game, and summed up which actions received the most reward in each of the three states (storing this information in an array, say), then it would basically learn the functional form of the table above.

In other words, if the agent simply chooses the action which it learnt had yielded the highest reward in the past (effectively learning some form of the table above) it would have learnt how to play the game successfully. Why do we need fancy concepts such as Q learning and neural networks then, when simply creating tables by summation is sufficient?

### Deferred reward

Well, the first obvious answer is that the game above is clearly very simple, with only 3 states and 2 actions per state. Real games are significantly more complex. The other significant concept that is missing in the example above is the idea of deferred reward. To adequately play most realistic games, an agent needs to learn to be able to take actions which may not immediately lead to a reward, but may result in a large reward further down the track.

Consider another game, defined by the table below:

In the game defined above, in all states, if Action 2 is taken, the agent moves back to State 1 i.e. it goes back to the beginning. In States 1 to 3, it also receives a reward of 5 when it does so. However, in all States 1 – 3, if Action 1 is taken, the agent moves forward to the next state, but doesn’t receive a reward until it reaches State 4 – at which point it receives a reward of 20. In other words, an agent is better off if it doesn’t take Action 2 to get an instantaneous reward of 5, but rather it should choose Action 1 consistently to progress through the states to get the reward of 20. The agent needs to be able to select actions which result in a delayed reward, if the delayed reward value is sufficiently large.

### The Q learning rule

This allows us to define the Q learning rule. In deep Q learning, the neural network needs to take the current state, s, as a variable and return a Q value for each possible action, a, in that state – i.e. it needs to return $Q(s,a)$ for all s and a. This $Q(s,a)$ needs to be updated in training via the following rule:

$$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’, a’) – Q(s,a)]$$

This updating rule needs a bit of unpacking. First, you can see that the new value of $Q(s,a)$ involves updating it’s current value by adding on some extra bits on the right hand side of the equation above. Moving left to right, ignore the $\alpha$ for a bit. We see inside the square brackets the first term is r which stands for the reward that is received for taking action a in state s. This is the immediate reward, no delayed gratification is involved yet.

The next term is the delayed reward calculation. First, we have the $\gamma$ value which discounts the delayed reward impact – it is always between 0 and 1. More on that in a second. The next term $\max_{a’} Q(s’, a’)$ is the maximum Q value possible in the next state. Let’s make that a bit clearer – the agent starts in state s, takes action a, ends up in state s’ and then the code determines the maximum Q value in state s’  i.e. $\max_{a’} Q(s’, a’)$.

So why is the value $\max_{a’} Q(s’, a’)$ considered? It is considered because it represents the maximum future reward coming to the agent if it takes action in state s. However, this value is discounted by $\gamma$ to take into account that it isn’t ideal for the agent to wait forever for a future reward – it is best for the agent to aim for the maximum award in the least period of time. Note that the value $Q(s’,a’)$ implicitly also holds the maximum discounted reward for the state after that, i.e. $Q(s”, a”)$ and likewise, it holds the discounted reward for the state $Q(s”’, a”’)$ and so on. This is how the agent can choose the action based on not just the immediate reward r, but also based on possible future discounted rewards.

The final components in the formula above are the $\alpha$ value, which is the learning rate during the updating, and finally the current state, $Q(s,a)$, which is subtracted from the square bracket sum. This is done to normalize the updating. Both $\alpha$ and the $Q(s,a)$ subtraction are not required to be explicitly defined in deep Q learning, as the neural network will take care of that during its optimized learning process. This process will be discussed in the next section.

### Deep Q learning

Deep Q learning applies the Q learning updating rule during the training process. In other words, a neural network is created which takes the state as its input, and then the network is trained to output appropriate Q(s,a) values for each action in state s. The action of the agent can then be chosen by taking the action with the greatest Q(s,a) value (by taking an argmax of the output of the neural network). This can be seen in the first step of the diagram below:

Once this step has been taken and an action has been selected, the agent can perform that action. The agent will then receive feedback on what reward is received by taking that action from that state. Now, the next step that we want to perform is to train the network according to the Q learning rule. This can be seen in the second part of the diagram above. The x input array for training the network is the state vector s, and the y output training sample is the Q(s,a) vector retrieved during the action selection step. However, one of the Q(s,a) values, corresponding to action a, is set to have a target of $r + \gamma Q(s’, a’)$ – this can be observed in the figure above.

By training the network in this way, the Q(s,a) output vector from the network will over time become better at informing the agent what action will be the best to select for its long term gain. There is a bit more to the story about action selection, however, which will be discussed in the next section.

## The epsilon-greedy policy

In the explanation above, the action selection policy was simply the action which corresponded to the highest Q output from the neural network. However, this policy isn’t the most effective. Why is that? It is because, when the neural network is randomly initialized, it will be predisposed to select certain sub-optimal actions randomly. This may cause the agent to fall into sub-optimal behavior patterns without thoroughly exploring the game and action / reward space. As such, the agent won’t find the best strategies to play the game.

It is useful here to introduce two concepts – exploration and exploitation. At the beginning of an optimization problem, it is best to allow the problem space to be explored extensively in the hope of finding good local (or even global) minima. However, once the problem space has been adequately searched, it is now best for the optimization algorithm to focus on exploiting what it has found by converging on the best minima to arrive at a good solution.

Therefore, in reinforcement learning, it is best to allow some randomness in the action selection at the beginning of the training. This randomness is determined by the epsilon parameter. Essentially, a random number is drawn between 0 and 1, and if it is less than epsilon, then a random action is selection. If not, an action is selected based on the output of the neural network. The epsilon variable usually starts somewhere close to 1, and is slowly decayed to somewhere around 0 during training. This allows a large exploration of the game at the beginning, but then the decay of the epsilon value allows the network to zero in on a good solution.

We’re almost at the point where we can check out the game that will be used in this example, and begin to build our deep Q network. However, there is just one final important point to consider.

## Batching in reinforcement learning

If a deep Q network is trained at each step in the game i.e. after each action is performed and the reward collected, there is a strong risk of over-fitting in the network. This is because game play is highly correlated i.e. if the game starts from the same place and the agent performs the same actions, there will likely be similar results each time (not exactly the same though, because of randomness in some games). Therefore, after each action it is a good idea to add all the data about the state, reward, action and the new state into some sort of memory. This memory can then be randomly sampled in batches to avoid the risk of over-fitting.

The network can therefore still be trained after each step if you desire (or less frequently, it’s up to the developer), but it is extracting the training data not from the agent’s ordered steps through the game, but rather a randomized memory of previous steps and outcomes that the agent has experienced. You’ll be able to see how this works in the code below.

We are now ready to examine the game/environment that we will develop our network to learn.

# The Mountain Car Environment and Open AI Gym

In this reinforcement learning tutorial, the deep Q network that will be created will be trained on the Mountain Car environment/game. This can be accessed through the open source reinforcement learning library called Open AI Gym. A screen capture from the rendered game can be observed below:

The object of this game is to get the car to go up the right-side hill to get to the flag. There’s one problem however, the car doesn’t have enough power to motor all the way up the hill. Instead, the car / agent needs to learn that it must motor up one hill for a bit, then accelerate down the hill and back up the other side, and repeat until it builds up enough momentum to make it to the top of the hill.

As stated above, Open AI Gym is an open source reinforcement learning package that allows developers to interact easily with games such as the Mountain Car environment. You can find details about the Mountain Car environment here. Basically, the environment is represented by a two-element state vector, detailed below:

As can be observed, the agent’s state is represented by the car’s position and velocity. The goal/flag is sitting at a position = 0.5. The actions available to the agent are shown below:

As can be observed, there are three actions available to the agent – accelerate to the left, right and no acceleration.

In the game’s default arrangement, for each time step where the car’s position is <0.5, it receives a reward of -1, up to a maximum of 200 time steps. So the incentive for the agent is to get the car’s position to >0.5 as soon as possible, after which the game ends. This will minimize the negative reward, which is the aim of the game.

However, in this default arrangement, it will take a significant period of time of random exploration before the car stumbles across the positive feedback of getting to the flag. As such, to speed things up a bit, in this example we’ll alter the reward structure to:

• Position > 0.1, r += 10
• Position > 0.25 r += 20
• Position > 0.5 r += 100

This new reward structure gives the agent better positive feedback when it starts learning how to ascend the hill on the right hand side toward the flag. The position of 0.1 is just over half way up the right-hand hill.

Ok, so now you know the environment, let’s write some code!

# Reinforcement learning in TensorFlow

In this reinforcement learning implementation in TensorFlow, I’m going to split the code up into three main classes, these classes are:

• Model: This class holds the TensorFlow operations and model definitions
• Memory: This class is where the memory of the actions, rewards and states are stored and retrieved from
• GameRunner: This class is the main training and agent control class

As stated before, I’ll be assuming some prior knowledge of TensorFlow here. If you’re not up to speed your welcome to wing it. Otherwise check out my TensorFlow tutorial. All the code for this tutorial can be found on this site’s Github repository.

I’ll go through each of the classes in turn in the sub-sections below.

## The Model class

class Model:
def __init__(self, num_states, num_actions, batch_size):
self._num_states = num_states
self._num_actions = num_actions
self._batch_size = batch_size
# define the placeholders
self._states = None
self._actions = None
# the output operations
self._logits = None
self._optimizer = None
self._var_init = None
# now setup the model
self._define_model()

def _define_model(self):
self._states = tf.placeholder(shape=[None, self._num_states], dtype=tf.float32)
self._q_s_a = tf.placeholder(shape=[None, self._num_actions], dtype=tf.float32)
# create a couple of fully connected hidden layers
fc1 = tf.layers.dense(self._states, 50, activation=tf.nn.relu)
fc2 = tf.layers.dense(fc1, 50, activation=tf.nn.relu)
self._logits = tf.layers.dense(fc2, self._num_actions)
loss = tf.losses.mean_squared_error(self._q_s_a, self._logits)
self._var_init = tf.global_variables_initializer()

The first function within the class is of course the initialization function. All you need to pass into the Model definition is the number of states of the environment (2 in this game), the number of possible actions (3 in this game) and the batch size. The function simply sets up a few internal variables and operations, some of which are exposed as public properties later in the class definition. At the end of the initialization, the second method displayed above _define_model() is called. This method sets up the model structure and the main operations.

First, two placeholders are created _states and _q_s_a – these hold the state data and the $Q(s,a)$ training data respectively. The first dimension of these placeholders is set to None, so that it will automatically adapt when a batch of training data is fed into the model and also when single predictions from the model are required. The next lines create two fully connected layers fc1 and fc2 using the handy TensorFlow layers module. These hidden layers have 50 nodes each, and they are activated using the ReLU activation function (if you want to know more about the ReLU, check out my vanishing gradient and ReLU tutorial).

The next layer is the output layer _logits – this is another fully connected or dense layer, but with no activation supplied. When no activation function is supplied to the dense layer API in TensorFlow, it defaults to a ‘linear’ activation i.e. no activation. This is what we want, as we want the network to learn continuous $Q(s,a)$ values across all possible real numbers.

Next comes the loss – this isn’t a classification problem, so a good loss to use is simply a mean squared error loss. The next line specifies the optimizer – in this example, we’ll just use the generic Adam optimizer. Finally, the TensorFlow boiler plate global variable initializer operation is assigned to _var_init.

So far so good. Next, some methods of the Model class are created to perform prediction and training:

    def predict_one(self, state, sess):
return sess.run(self._logits, feed_dict={self._states:
state.reshape(1, self.num_states)})

def predict_batch(self, states, sess):
return sess.run(self._logits, feed_dict={self._states: states})

def train_batch(self, sess, x_batch, y_batch):
sess.run(self._optimizer, feed_dict={self._states: x_batch, self._q_s_a: y_batch})

The first method predict_one simply returns the output of the network (i.e. by calling the _logits operation) with an input of a single state. Note the reshaping operation that is used to ensure that the data has a size (1, num_states). This is called whenever action selection by the agent is required. The next method, predict_batch predicts a whole batch of outputs when given a whole bunch of input states – this is used to perform batch evaluation of $Q(s,a)$ and $Q(s’,a’)$ values for training. Finally, there is a method called train_batch which takes a batch training step of the network.

That’s the Model class, now it is time to consider the Memory class.

## The Memory class

The next class to consider in the code is the Memory class – this class stores all the results of the action of the agent in the game, and also handles the retrieval. These can be used to batch train the network.

class Memory:
def __init__(self, max_memory):
self._max_memory = max_memory
self._samples = []

self._samples.append(sample)
if len(self._samples) > self._max_memory:
self._samples.pop(0)

def sample(self, no_samples):
if no_samples > len(self._samples):
return random.sample(self._samples, len(self._samples))
else:
return random.sample(self._samples, no_samples)

First, when the Memory class is initialized, it is necessary to supply a maximum memory argument – this will control the maximum number of (state, action, reward, next_state) tuples the _samples list can hold. The bigger the better, as it ensures better random mixing of the samples, but you have to make sure you don’t run into memory errors.

The first method, add_sample takes an individual (state, action, reward, next_state) tuple and appends it to the _samples list. After this, a check is made – if the number of samples is now larger than the allowable memory size, the first element in _samples is removed using the Python .pop() list functionality.

The final method, sample returns a random selection of no_samples in length. However, if the no_samples argument is larger than the actual memory, whatever is available in the memory is returned.

The final class is called GameRunner.

## The GameRunner class

The GameRunner class in this example is where all the model dynamics, agent action and training is organised.

class GameRunner:
def __init__(self, sess, model, env, memory, max_eps, min_eps,
decay, render=True):
self._sess = sess
self._env = env
self._model = model
self._memory = memory
self._render = render
self._max_eps = max_eps
self._min_eps = min_eps
self._decay = decay
self._eps = self._max_eps
self._steps = 0
self._reward_store = []
self._max_x_store = []

In the GameRunner initialization, some internal variables are created. Note, it takes as first argument a TensorFlow session object, then a neural network Model, an Open AI gym environment and a Memory class instance. The next arguments max_eps and min_eps dictate the maximum and minimum epsilon values respectively – during training the actual $\epsilon$ will decay from the maximum to the minimum based on the following argument decay. Finally, render is a boolean which determines whether the game environment is rendered to the screen.

The next method is run():

    def run(self):
state = self._env.reset()
tot_reward = 0
max_x = -100
while True:
if self._render:
self._env.render()

action = self._choose_action(state)
next_state, reward, done, info = self._env.step(action)
if next_state[0] >= 0.1:
reward += 10
elif next_state[0] >= 0.25:
reward += 20
elif next_state[0] >= 0.5:
reward += 100

if next_state[0] > max_x:
max_x = next_state[0]
# is the game complete? If so, set the next state to
# None for storage sake
if done:
next_state = None

self._replay()

# exponentially decay the eps value
self._steps += 1
self._eps = MIN_EPSILON + (MAX_EPSILON - MIN_EPSILON) \
* math.exp(-LAMBDA * self._steps)

# move the agent to the next state and accumulate the reward
state = next_state
tot_reward += reward

# if the game is done, break the loop
if done:
self._reward_store.append(tot_reward)
self._max_x_store.append(max_x)
break

print("Step {}, Total reward: {}, Eps: {}".format(self._steps, tot_reward, self._eps))

We’ll go through each step in the code above. First, the environment is reset by calling the Open AI Gym command .reset(). Then an infinite loop is entered into – this will be exited by calling a break command. If the boolean _render is True, then the output of the game will be shown on the screen. The action of the agent is determined by calling the internal method _choose_action(state) – this will discussed later. Next, the agent takes action by calling the Open AI Gym command step(action). This command returns a tuple containing the new state of the agent, the reward received by taking action, a done boolean indicating whether the game has finished, and an information object (we won’t using info in this example).

The next step in the code is where there are some manual adjustments to the Mountain Car reward system. If you recall, earlier I mentioned that in order to speed up the training of the network, it was useful to add some more reward steps the closer the car got to the goal (rather than the default reward which was only received when the car reached the goal/flag). The maximum x value achieved in the given episode is also tracked and this will be stored once the game is complete.

The next step is a check to see if the game has completed i.e. done == True – this will occur after 200 turns. If it has completed, we want to set the next_state to None. This will be picked up during the training / replay step of the class, and the state will be set to an array of zeros whenever next_state is equal to None.

After this, the data about the agent is stored in the memory class – i.e.its original state, its chosen action, the reward it received for that action and finally the next_state of the agent. After this takes place, the training / replay step of the deep Q network is run – this step will be discussed more below. At this point the epsilon value is also exponentially decayed. Finally, the agent’s state is moved to next_state, the total reward during the game is accumulated, and there is some printing and breaking of the loop and storing of relevant variables if the game is complete.

The next part of the GameRunner class is the agent action selection method:

    def _choose_action(self, state):
if random.random() < self._eps:
return random.randint(0, self._model.num_actions - 1)
else:
return np.argmax(self._model.predict_one(state, self._sess))

This method executes our epsilon greedy + Q policy. In the first case, if a random number is less than the _eps value, then the returned action will simply be an action chosen at random from the set of possible actions. Otherwise, the action will be chosen based on an argmax of the output from the neural network. Recall that _predict_one from the model will take a single state as input, then output $Q(s,a)$ values for each of the possible actions available – the action with the highest $Q(s,a)$ value is that action with the highest expected current + future discounted reward.

The final method within the GameRunner class is the _replay method, where the batching and training takes place:

    def _replay(self):
batch = self._memory.sample(self._model.batch_size)
states = np.array([val[0] for val in batch])
next_states = np.array([(np.zeros(self._model.num_states)
if val[3] is None else val[3]) for val in batch])
# predict Q(s,a) given the batch of states
q_s_a = self._model.predict_batch(states, self._sess)
# predict Q(s',a') - so that we can do gamma * max(Q(s'a')) below
q_s_a_d = self._model.predict_batch(next_states, self._sess)
# setup training arrays
x = np.zeros((len(batch), self._model.num_states))
y = np.zeros((len(batch), self._model.num_actions))
for i, b in enumerate(batch):
state, action, reward, next_state = b[0], b[1], b[2], b[3]
# get the current q values for all actions in state
current_q = q_s_a[i]
# update the q value for action
if next_state is None:
# in this case, the game completed after action, so there is no max Q(s',a')
# prediction possible
current_q[action] = reward
else:
current_q[action] = reward + GAMMA * np.amax(q_s_a_d[i])
x[i] = state
y[i] = current_q
self._model.train_batch(self._sess, x, y)

The first step in the _replay method is to retrieve a randomized batch of data from memory. Next, we want to setup our batch state variables so that we can:

1. For each state, produce baseline $Q(s,a)$ values – one of which will be given a target of $r + \gamma \max_{a’} Q(s’, a’)$
2.  For each next_state, predict $Q(s’,a’)$ from the model, as required in (1)

Now, if you recall, each sample in memory has the form of a tuple: state, action, reward, next_state which was extracted from the game play. To setup a batch of initial states, then, we simply use Python list comprehension to extract the first tuple value from each sample in the batch. Likewise, we do the same for the fourth value in the tuple to extract the next_state value for each sample in the batch. Note that whenever the next_state corresponds to a case where the game finished (i.e. next_state is None) the next state value is replaced by a vector of zeros corresponding in size to the number of states in the game.

Next, the batch of $Q(s, a)$ and $Q(s’,a’)$ values are extracted from the model from states and next_states respectively. The and y training arrays are then created, but initially filled with zeros. After this, a loop is entered into to accumulate the and y values on which to train the model. Within this loop, we extract the memory values from the batch, then set a variable designating the Q values for the current state. If the next_state value is actually zero, there is no discounted future rewards to add, so the current_q corresponding to action is set a target of the reward only. Alternatively, if there is a valid next_state, then the current_q corresponding to action is set a target of the reward plus the discounted future reward i.e. $max_{a’} Q(s’, a’)$.

The state and current_q are then loaded into the and values for the given batch, until the batch data is completely extracted. Then the network is trained by calling _train_batch() on the model.

That completes the review of the main classes within the TensorFlow reinforcement learning example. All that is left is to setup the classes and enter the training loop.

## The main function

The code below sets up the environment and the classes, and runs multiple games to perform the learning:

if __name__ == "__main__":
env_name = 'MountainCar-v0'
env = gym.make(env_name)

num_states = env.env.observation_space.shape[0]
num_actions = env.env.action_space.n

model = Model(num_states, num_actions, BATCH_SIZE)
mem = Memory(50000)

with tf.Session() as sess:
sess.run(model.var_init)
gr = GameRunner(sess, model, env, mem, MAX_EPSILON, MIN_EPSILON,
LAMBDA)
num_episodes = 300
cnt = 0
while cnt < num_episodes:
if cnt % 10 == 0:
print('Episode {} of {}'.format(cnt+1, num_episodes))
gr.run()
cnt += 1
plt.plot(gr.reward_store)
plt.show()
plt.close("all")
plt.plot(gr.max_x_store)
plt.show()

In the first couple of lines, we create an Open AI Gym Mountain Car environment. Next, the number of states and actions are extracted from the environment object itself.

The network model and memory objects are then created – in this case, we’re using a batch size of 50 and a total number of samples in the memory of 50,000.

The TensorFlow session object is created, along with the variable initialization – then the GameRunner class is created. The number of episodes of the Mountain Car game which will be run in this training example is 300. For each of these episodes, we run the game by using the GameRunner run() method.

After all the episodes are run, some plotting is performed on the total reward for each episode, and the maximum x-axis value the cart reaches in the game (remembering that the goal is at x = 0.5). These plots can be observed below:

As can be observed, the network starts out controlling the agent rather poorly, while it is exploring the environment and accumulating memory. However once it starts to receive positive rewards by ascending the right-hand hill, the rewards rapidly increase.

As can be observed above, while there is some volatility, the network learns that the best rewards are achieved by reaching the top of the right-hand hill and, towards the end of the training, consistently controls the car/agent to reach there.

This reinforcement learning tutorial in TensorFlow has shown you:

1. The basics of Q learning
2. The epsilon greed action selection policy
3.  The importance of batching in training deep Q reinforcement learning networks, and
4. How to implement a deep Q reinforcement learning network in TensorFlow

I hope it has been instructive – keep an eye out for future tutorials in reinforcement learning where more complicated games and techniques will be reviewed.

Recommended online course – If you are more of a video learner, check out this inexpensive online course: Advanced AI: Deep Reinforcement Learning in Python