# Prioritised Experience Replay in Deep Q Learning

In previous posts (here, here and here and others), I have introduced various Deep Q learning methodologies. If you have been across these posts, you will have observed that a memory buffer is used to recycle older experiences of the agent in the environment and improve learning. This is called experience replay. In this post, I’m going to introduce an important concept called Prioritised Experience Replay (PER). This is a version of experience replay which more frequently calls on those experiences of the agent where there is more learning value. In this introduction, I’ll be using a Dueling Q network architecture, and referring to a previous post on the SumTree algorithm. All code presented in this post can be found on this site’s Github repository.

## Prioritised experience replay

Standard versions of experience replay in deep Q learning consist of storing experience-tuples of the agent as it interacts with it’s environment. These tuples generally include the state, the action the agent performed, the reward the agent received and the subsequent action. These tuples are generally stored in some kind of experience buffer of a certain finite capacity. During the training of the deep Q network, batches of prior experience are extracted from this memory. Importantly, the samples in these training batches are extracted randomly and uniformly across the experience history.

Prioritised experience replay is an optimisation of this method. The intuition behind prioritised experience replay is that every experience is not equal when it comes to productive and efficient learning of the deep Q network. Consider a past experience in a game where the network already accurately predicts the Q value for that action. This experience sample, passed through the training process, will yield little in the way of improvements of the predictive capacity of the network. However, another experience sample may result in a poor estimation of the actual Q value at this state – this signifies that there is something valuable to learn about the experience and the algorithm should be encouraged to sample it.

### Prioritised experience replay in a Double Q setting

What should the measure be to “rank” the sampling used in Prioritised Experience Replay? The most obvious answer is the difference between the predicted Q value, and what the Q value should be in that state and for that action. This difference is called the TD error, and looks like this (for a Double Q type network, see this post for more details):

$$\delta_i = r_{t} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t) – Q(s_{t}, a_{t}; \theta_t)$$

Here the left hand part of the equation is what the Q value should be (the target value): $r_{t} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t)$. The right hand part of the equation is what the Double Q network is actually predicting at the present time: $Q(s_{t}, a_{t}; \theta_t)$.  The difference between these two quantities ($\delta_i$) is the “measure” of how much the network can learn from the given experience sample i. The higher the value, the more often this sample should be chosen.

Note, the notation above for the Double Q TD error features the $\theta_t$ and $\theta^-_t$ values – these are the weights corresponding to the primary and target networks, respectively. The primary network should be used to produce the right hand side of the equation above (i.e. $Q(s_{t}, a_{t}; \theta_t)$).

Often, to reduce the variance of $\delta_i$, the Huber loss function is used on this TD error. The Huber loss function will be used in the implementation below.

### Drawing prioritised samples

A common way of setting the priorities of the experience samples is by adding small constant to the TD error term like so:

$$p_i = | \delta_i | + \epsilon$$

This ensures that, even with samples which have a low $\delta_i$, they still have a small chance of being selected for sampling. Using these priorities, the discrete probability of drawing sample/experience i under Prioritised Experience Replay is:

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

Notice the $\alpha$ factor – this is a way of scaling the prioritisation based on the TD error up or down. If $\alpha = 0$ then all of the $p_i^{\alpha}$ terms go to 1 and every experience has the same chance of being selected, regardless of the TD error. Alternatively, if $\alpha = 1$ then “full prioritisation” occurs i.e. every sample is randomly selected proportional to its TD error (plus the constant). A commonly used $\alpha$ value is 0.6 – so that prioritisation occurs but it is not absolute prioritisation. This promotes some exploration in addition to the PER process.

### Importance sampling

Another aspect of Prioritised Experience Replay is a concept called Importance Sampling (IS). This part of Prioritised Experience Replay requires a bit of unpacking, for it is not intuitively obvious why it is required. When we are performing some Q based learning, we are trying to minimise the TD error by changing the model parameters $\theta$. However, we don’t have an exact function for the TD error based on all the possible states, actions and rewards in an environment. Instead, what we have are samples from the environment in the form of these experience tuples (states, actions, rewards).

Therefore, what we are really trying to minimise the expected value of the TD error, based on these samples. However, by drawing experience tuples based on the prioritisation discussed above, we are skewing or biasing this expected value calculation. This will lead to the result of us not actually solving the problem we are supposed to be solving during the training of the network.

A solution to this problem is to use something called importance sampling. There is more to IS, but, in this case, it is about applying weights to the TD error to try to correct the aforementioned bias. What does this look like? Here is an expression of the weights which will be applied to the loss values during training:

$$w_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\beta$$

This weight value will be multiplied by the TD error ($\delta_i$), which has the same effect as reducing the gradient step during training. These weights can “slow down” the learning of certain experience samples with respect to others. The variable N refers to the number of experience tuples already stored in your memory (and will top-out at the size of your memory buffer once it’s full). By looking at the equation, you can observe that the higher the probability of the sampling, the lower this weight value will be.

Because experience samples with a high priority / probability will be sampled more frequently under PER, this weight value ensures that the learning is slowed for these samples. This ensures that the training is not “overwhelmed” by the frequent sampling of these higher priority / probability samples and therefore acts to correct the aforementioned bias.

The $\beta$ value is generally initialised between 0.4 and 0.6 at the start of training and is annealed towards 1 at the end of the training. Because the value within the bracket is always < 1, a $\beta$ of < 1 will actually increase the weight values towards 1 and reduce the effect of these weight values. The authors of the original paper argue that at the beginning of the training, the learning is chaotic and the bias caused by the prioritisation doesn’t matter much anyway. It is only towards the end of the training that this bias needs to be corrected, so the $\beta$ value being closer to 1 decreases the weights for high priority / probability samples and therefore corrects the bias more. Note that in practice these weights $w_i$ in each training batch are rescaled so that they range between 0 and 1.

### Sampling and the SumTree data structure

In order to sample experiences according to the prioritisation values, we need some way of organising our memory buffer so that this sampling is efficient. One feasible way of sampling is to create a cumulative sum of all the prioritisation values, and then sample from a uniform distribution of interval (0, max(cumulative_prioritisation)). This will result in sampling which is appropriately weighted according to the prioritisation values i.e. according to $P(i)$. However, this method of sampling requires an iterative search through the cumulative sum until the random value is greater than the cumulative value – this will then be the selected sample.

This is fine for small-medium sized datasets, however for very large datasets such as the memory buffer in deep Q learning (which can be millions of entries long), this is an inefficient way of selecting prioritised samples. The alternative to this method of sampling is the SumTree data structure and algorithms. The SumTree structure won’t be reviewed in this post, but the reader can look at my comprehensive post here on how it works and how to build such a structure.

That concludes the theory component of Prioritised Experience Replay and now we can move onto what the code looks like.

## Implementing Prioritised Experience Replay in TensorFlow 2

The code below will demonstrate how to implement Prioritised Experience Replay in TensorFlow 2. The code for this example can be found on this site’s Github repo. Please note that this code will heavily utilise code from one of my previous tutorials on Dueling Q learning in Atari environments and common code components will not be explained in detail in this post. The reader can go back to that post if they wish to review the intricacies of Dueling Q learning and using it in the Atari environment. This example will be demonstrated in the Space Invaders Atari OpenAI environment. The first part of the code can be observed below:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard'
MAX_EPSILON = 1
MIN_EPSILON = 0.1
EPSILON_MIN_ITER = 500000
GAMMA = 0.99
BATCH_SIZE = 32
TAU = 0.08
POST_PROCESS_IMAGE_SIZE = (105, 80, 1)
DELAY_TRAINING = 50000
BETA_DECAY_ITERS = 500000
MIN_BETA = 0.4
MAX_BETA = 1.0
NUM_FRAMES = 4
GIF_RECORDING_FREQ = 100
MODEL_SAVE_FREQ = 100
num_actions = env.action_space.n
def huber_loss(loss):
return 0.5 * loss ** 2 if abs(loss) < 1.0 else abs(loss) - 0.5
class DQModel(keras.Model):
def __init__(self, hidden_size: int, num_actions: int, dueling: bool):
super(DQModel, self).__init__()
self.dueling = dueling
self.conv1 = keras.layers.Conv2D(16, (8, 8), (4, 4), activation='relu')
self.conv2 = keras.layers.Conv2D(32, (4, 4), (2, 2), activation='relu')
self.flatten = keras.layers.Flatten()
kernel_initializer=keras.initializers.he_normal())
kernel_initializer=keras.initializers.he_normal())
if dueling:
self.v_dense = keras.layers.Dense(hidden_size, activation='relu',
kernel_initializer=keras.initializers.he_normal())
self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal())
self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x))
def call(self, input):
x = self.conv1(input)
x = self.conv2(x)
x = self.flatten(x)
if self.dueling:
v = self.v_dense(x)
v = self.v_out(v)
return combined
primary_network = DQModel(256, num_actions, True)
target_network = DQModel(256, num_actions, True)
# make target_network = primary_network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
t.assign(e)

First, the reader can see that various constants are declared. Following this, a custom Huber loss function is declared, this will be used later in the code. Next, a custom Keras model is created which instantiates a Dueling Q architecture – again, refer to my previous post for more details on this. Finally, a primary and target network are created to perform Double Q learning, and the target and primary network weights are set to be equal.

After this declaration, the SumTree data structures and functions are developed. For more details, as stated previously, see my SumTree post.

Next, the Memory class is created:

class Memory(object):
def __init__(self, size: int):
self.size = size
self.curr_write_idx = 0
self.available_samples = 0
self.buffer = [(np.zeros((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1]),
dtype=np.float32), 0.0, 0.0, 0.0) for i in range(self.size)]
self.base_node, self.leaf_nodes = create_tree([0 for i in range(self.size)])
self.frame_idx = 0
self.action_idx = 1
self.reward_idx = 2
self.terminal_idx = 3
self.beta = 0.4
self.alpha = 0.6
self.min_priority = 0.01

The input arguments to the class is the size of the memory buffer (i.e. how many experience tuples it will hold). The curr_write_idx variable designates the current position in the buffer to place new experience tuples. If this value exceeds the size of the buffer, it is reset back to the beginning of the buffer -> 0. The available_samples variable is a measure of how many samples have been placed in the buffer. Once the buffer has been filled for the first time, this variable will be equal to the size variable.

Next, the experience buffer is initialized with zeros. It is important that you initialize this buffer at the beginning of the training, as you will be able to instantly determine whether your machine has enough memory to handle the size of this buffer. If you don’t initialize but dynamically append to a list, you run the risk of exceeding the memory of your machine half way through your training – which can be frustrating during long training runs!

The next line involves the creation of the SumTree object. The SumTree is initialised with the number of leaf nodes equal to the size of the buffer, and with a value of 0. Again, for more details on the SumTree object, see this post. The following variable declarations (frame_idx, action_idx, reward_idx and terminal_idx) specify what tuple indices relate to each of the variable types stored in the buffer. Finally, the IS $\beta$ and the PER $\alpha$ values are initialised with values previously discussed above, and the minimum priority value to add to each experience tuple is defined as some small float value.

The next method in the Memory class appends a new experience tuple to the buffer and also updates the priority value in the SumTree:

def append(self, experience: tuple, priority: float):
self.buffer[self.curr_write_idx] = experience
self.update(self.curr_write_idx, priority)
self.curr_write_idx += 1
# reset the current writer position index if creater than the allowed size
if self.curr_write_idx >= self.size:
self.curr_write_idx = 0
# max out available samples at the memory buffer size
if self.available_samples + 1 < self.size:
self.available_samples += 1
else:
self.available_samples = self.size - 1
def update(self, idx: int, priority: float):
return np.power(priority + self.min_priority, self.alpha)

Here you can observe that both the experience tuple (state, action, reward, terminal) and the priority of this experience are passed to this method. The experience tuple is written to the buffer at curr_write_idx and the priority is sent to the update method of the class. The update method of the Memory class in turn calls the SumTree update function which is outside this class. Notice that the “raw” priority is not passed to the SumTree update, but rather the “raw” priority is first passed to the adjust_priority method. This method adds the minimum priority factor and then raises the priority to the power of $\alpha$ i.e. it performs the following calculations:

$$p_i = | \delta_i | + \epsilon$$

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

After the experience tuple is added, the current write index is incremented. If the current write index now exceeds the size of the buffer, it is reset back to 0 to start overwriting old experience tuples. Next, the available_samples value is incremented, but only if it is less than the size of the memory, otherwise it is clipped at the size of the memory. Next is the (rather complicated) sample method:

def sample(self, num_samples: int):
sampled_idxs = []
is_weights = []
sample_no = 0
while sample_no < num_samples:
sample_val = np.random.uniform(0, self.base_node.value)
samp_node = retrieve(sample_val, self.base_node)
if NUM_FRAMES - 1 < samp_node.idx < self.available_samples - 1:
sampled_idxs.append(samp_node.idx)
p = samp_node.value / self.base_node.value
is_weights.append((self.available_samples + 1) * p)
sample_no += 1
# apply the beta factor and normalise so that the maximum is_weight < 1
is_weights = np.array(is_weights)
is_weights = np.power(is_weights, -self.beta)
is_weights = is_weights / np.max(is_weights)
# now load up the state and next state variables according to sampled idxs
states = np.zeros((num_samples, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
dtype=np.float32)
next_states = np.zeros((num_samples, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
dtype=np.float32)
actions, rewards, terminal = [], [], []
for i, idx in enumerate(sampled_idxs):
for j in range(NUM_FRAMES):
states[i, :, :, j] = self.buffer[idx + j - NUM_FRAMES + 1][self.frame_idx][:, :, 0]
next_states[i, :, :, j] = self.buffer[idx + j - NUM_FRAMES + 2][self.frame_idx][:, :, 0]
actions.append(self.buffer[idx][self.action_idx])
rewards.append(self.buffer[idx][self.reward_idx])
terminal.append(self.buffer[idx][self.terminal_idx])
return states, np.array(actions), np.array(rewards), next_states, np.array(terminal), sampled_idxs, is_weights

The purpose of this method is to perform priority sampling of the experience buffer, but also to calculate the importance sampling weights for use in the training steps. The first step is a while loop which iterates until num_samples have been sampled. This sampling is performed by selecting a uniform random number between 0 and the base node value of the SumTree. This sample value is then retrieved from the SumTree data structure according to the stored priorities.

A check is then made to ensure that the sampled index is valid and if so it is appended to a list of sampled indices. After this appending, the $P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$ value is calculated. The SumTree base node value is actually the sum of all priorities of samples stored to date. Also recall that the $\alpha$ value has already been applied to all samples as the “raw” priorities are added to the SumTree.

Now, the IS weights are calculated according to the following:

$$w_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\beta$$

This can alternatively be expressed as:

$$w_i = \left( N \cdot P(i) \right)^{-\beta}$$

On the next line of the code, the following values are appended to the is_weights list: $\left( N \cdot P(i) \right)$. Following the accumulation of the samples, the IS weights are then converted from a list to a numpy array, then each value is raised element-wise to the power of $-\beta$. Following this, the IS weights are then normalised so that they span between 0 and 1, which acts to stabilise learning.

Next, the states and next_states arrays are initialised – in this case, these arrays will consist of 4 stacked frames of images for each training sample. For more explanation on training in an Atari environment with stacked frames – see this post. Finally, these frame / state arrays, associated rewards and terminal states, and the IS weights are returned from the method.

That concludes the explanation of the rather complicated Memory class.

Next we initialise the Memory class and declare a number of other ancillary functions (which have already been discussed here).

memory = Memory(200000)
def image_preprocess(image, new_size=(105, 80)):
# convert to greyscale, resize and normalize the image
image = tf.image.rgb_to_grayscale(image)
image = tf.image.resize(image, new_size)
image = image / 255
return image

def choose_action(state, primary_network, eps, step):
if step < DELAY_TRAINING:
return random.randint(0, num_actions - 1)
else:
if random.random() < eps:
return random.randint(0, num_actions - 1)
else:
return np.argmax(primary_network(tf.reshape(state, (1, POST_PROCESS_IMAGE_SIZE[0],
POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)).numpy()))

def update_network(primary_network, target_network):
# update target network parameters slowly from primary network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
t.assign(t * (1 - TAU) + e * TAU)

def process_state_stack(state_stack, state):
for i in range(1, state_stack.shape[-1]):
state_stack[:, :, i - 1].assign(state_stack[:, :, i])
state_stack[:, :, -1].assign(state[:, :, 0])
return state_stack

def record_gif(frame_list, episode, fps=50):
imageio.mimsave(STORE_PATH + "\\SPACE_INVADERS_EPISODE-eps{}-r{}.gif".format(episode, reward), frame_list, fps=fps) #duration=duration_per_frame)ation_per_frame)


The next function calculates the target Q values for training (see this post for details on Double Q learning) and also calculates the $\delta(i)$ priority values for each sample:

def get_per_error(states, actions, rewards, next_states, terminal, primary_network, target_network):
# predict Q(s,a) given the batch of states
prim_qt = primary_network(states)
# predict Q(s',a') from the evaluation network
prim_qtp1 = primary_network(next_states)
# copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action
target_q = prim_qt.numpy()
# the action selection from the primary / online network
prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
# the q value for the prim_action_tp1 from the target network
q_from_target = target_network(next_states)
updates = rewards + (1 - terminal) * GAMMA * q_from_target.numpy()[:, prim_action_tp1]
# calculate the loss / error to update priorites
error = [huber_loss(target_q[i, actions[i]] - prim_qt.numpy()[i, actions[i]]) for i in range(states.shape[0])]
return target_q, error

The first part of the function and how it works to estimate the target Q values has been discussed in previous posts (see here). The new features in this Prioritised Experience Replay example is the calculation of error. The value that is calculated on this line is the TD error, but the TD error passed through a Huber loss function:

$$\delta_i = r_{t} + \gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; \theta_t); \theta^-_t) – Q(s_{t}, a_{t}; \theta_t)$$

Which is the same as:

$$\delta_i = target_q – Q(s_{t}, a_{t}; \theta_t)$$

Note that $Q(s_{t}, a_{t}; \theta_t)$ is extracted from the primary network (with weights of $\theta_t$). Both the target Q values and the Huber error / $\delta_i$ are returned from this function.

The next function uses the get_per_error function just reviewed, updates the priority values for these samples in the memory, and also trains the primary network:

def train(primary_network, memory, target_network):
states, actions, rewards, next_states, terminal, idxs, is_weights = memory.sample(BATCH_SIZE)
target_q, error = get_per_error(states, actions, rewards, next_states, terminal,
primary_network, target_network)
for i in range(len(idxs)):
memory.update(idxs[i], error[i])
loss = primary_network.train_on_batch(states, target_q, is_weights)
return loss

As can be observed, first a batch of samples are extracted from the memory. Next the target_q and Huber loss TD errors are calculated. For each memory index, the error is passed to the Memory update method. Note that, every time a sample is drawn from memory and used to train the network, the new TD errors calculated in that process are passed back to the memory so that the priority of these samples are then updated. This ensures that samples with TD errors which were once high (and were therefore valuable due to the fact that the network was not predicting them well) but are now low (due to network training) will no longer be sampled as frequently.

Finally, the primary network is trained on the batch of states and target Q values. Note that a third argument is passed to the Keras train_on_batch function – the importance sampling weights. The Keras train_on_batch function has an optional argument which applies a multiplicative weighting factor to each loss value – this is exactly what we need to apply the IS adjustment to the loss values.

The code following is the main training / episode loop:

num_episodes = 1000000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + "/DuelingQPERSI_{}".format(dt.datetime.now().strftime('%d%m%Y%H%M')))
steps = 0
for i in range(num_episodes):
state = env.reset()
state = image_preprocess(state)
state_stack = tf.Variable(np.repeat(state.numpy(), NUM_FRAMES).reshape((POST_PROCESS_IMAGE_SIZE[0],
POST_PROCESS_IMAGE_SIZE[1],
NUM_FRAMES)))
cnt = 1
avg_loss = 0
tot_reward = 0
if i % GIF_RECORDING_FREQ == 0:
frame_list = []
while True:
if render:
env.render()
action = choose_action(state_stack, primary_network, eps, steps)
next_state, reward, done, info = env.step(action)
tot_reward += reward
if i % GIF_RECORDING_FREQ == 0:
frame_list.append(tf.cast(tf.image.resize(next_state, (480, 320)), tf.uint8).numpy())
next_state = image_preprocess(next_state)
old_state_stack = state_stack
state_stack = process_state_stack(state_stack, next_state)
if steps > DELAY_TRAINING:
loss = train(primary_network, memory, target_network)
update_network(primary_network, target_network)
_, error = get_per_error(tf.reshape(old_state_stack, (1, POST_PROCESS_IMAGE_SIZE[0],
POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)),
np.array([action]), np.array([reward]),
tf.reshape(state_stack, (1, POST_PROCESS_IMAGE_SIZE[0],
POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)), np.array([done]))
# store in memory
memory.append((next_state, action, reward, done), error[0])
else:
loss = -1
# store in memory - default the priority to the reward
memory.append((next_state, action, reward, done), reward)
avg_loss += loss
# linearly decay the eps and PER beta values
if steps > DELAY_TRAINING:
eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * \
(MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else \
MIN_EPSILON
beta = MIN_BETA + ((steps - DELAY_TRAINING) / BETA_DECAY_ITERS) * \
(MAX_BETA - MIN_BETA) if steps < BETA_DECAY_ITERS else \
MAX_BETA
memory.beta = beta
steps += 1
if done:
if steps > DELAY_TRAINING:
avg_loss /= cnt
print("Episode: {}, Reward: {}, avg loss: {:.5f}, eps: {:.3f}".format(i, tot_reward, avg_loss, eps))
with train_writer.as_default():
tf.summary.scalar('reward', tot_reward, step=i)
tf.summary.scalar('avg loss', avg_loss, step=i)
else:
print("Pre-training...Episode: {}".format(i))
if i % GIF_RECORDING_FREQ == 0:
record_gif(frame_list, i, tot_reward)
break
cnt += 1

This training loop has been explained in detail here, so please refer to that post for a detailed explanation. The first main difference to note is the linear increment from MIN_BETA to MAX_BETA (0.4 to 1.0) over BETA_DECAY_ITERS number of training steps – the purpose of this change in the $\beta$ value has been explained previously. The next major difference results from the need to feed a priority value into memory along with the experience tuple during each episode step. This is calculated by calling the get_per_error function that was explained previously, and this error is passed to the memory append method. Before training of the network is actually started (i.e. episodes < DELAY_TRAINING), the reward is substituted for the priority in the memory.

This concludes the explanation of the code for this Prioritised Experience Replay example. Now let’s look at the results.

## Prioritised Experience Replay Atari Space Invader training results

The graph below shows the progress of the rewards over ~1000 episodes of training in the Open AI Space Invader environment, using Prioritised Experience Replay:

Prioritised Experience Replay training results

This concludes my post introducing the important Prioritised Experience Replay concept. In future posts, I’ll deal with other types of reinforcement learning algorithms.