Everything you need to know about Reinforcement Learning in 80 minutes

80 min readJan 25, 2022

A quick note!

Hey! This article will contain a lot of my notes that I’ve been taking from David Silver’s Reinforcement Learning Course with some of my notes. This article can be used as a self-study guide or as a read-a-long with his course! This article assumes that you have a good understanding of high school math + basic differentiation. You can find the original lectures here. To learn more about me, here’s my twitter + linkedin + website.
It would mean a lot if you could help share this resource wherever you can! I spent a lot of time curating notes + creating a good understanding for people to learn RL from, and my goal is to help as many people as I can!

Reinforcement Learning sits at the intersection of many fields with the same fundamental idea: the science of decision making. In Computer Science, it’s machine learning but in Neuroscience, it’s reward systems. Engineering, optimal control.

Why is RL different from other machine learning paradigms + fields?

There aren’t supervisors, only rewards
Feedback in RL is generally delayed, not spontaneous
Time plays a big role for performing in dynamic systems such as RL
The agent’s actions affect the subsequent data it receives (closed feedback loop)

General applications right now?

Fly stunt maneuvers in a helicopter
Defeating world champion in games i.e. GO, Atari
Investment portfolios
Walking robots

Defining the Reinforcement Learning problem

A reward Rₜ is this general scalar feedback signal to help the agent understand how well it’s doing at step t. Generally, the job of the agent is to maximize the cumulative reward. RL is based on this “reward hypothesis”; all goals can be described through the maximization of expected cumulative reward.

What’s the goal here? select certain actions that maximize the total future reward. Sometimes, it might be better for you to endure a short-term loss for long-term gains in an environment (ex. chess.)

Understanding the agent + environment

The key principles of RL are based on this interaction of agent and environment.

At each step t, the agent:

executes action Aₜ
receives observation Oₜ
receives scalar reward Rₜ

the environment:

receives action Aₜ
emits observation Oₜ
emits scalar reward Rₜ

History + state

The history is generally the sequence of observations, actions, and rewards.

We want to measure all observable variables up until time t. Generally, what happens next depends on that history.

The agent selections actions
The environment selects observations/rewards

The state is the information that we can use to determine what happens next. It’s formally defined as a function of history:

For example, the environment state, Sᵉₜ is the environment’s private representation. This is exactly whatever data the environment generally uses to pick the next observation/reward to give to the agent [note: the environment state is not visible to the agent. Even if it is, it may contain irrelevant info.]

The agent state Sᵃₜ is the agent’s internal representation of those observations. The state generally contains whatever information the agent needs to use to pick the next action. It can be any function of history.

Markov state

Also known as the information state. Think of it like an array that contains all the useful information through history.

A state Sₜ is Markov if and only if…

The probability of the next state [given the state that you are currently in] will be the same even if you showed all the previous states of the system.

What does this mean? You can throw all the previous states away and just use the current one and you’ll still get the same probability. The future is independent of the past given the present.

Therefore, the environment state Sᵉₜ is Markov because it fully characterizes what’ll happen next in the environment because that’s exactly what the environment is using to predict its next environment. The same logic also applies with the history Hₜ.

Fully observable environments

This is exactly when the agent can directly see + observe the environment state (this is generally ideal). The agent state = environment state = information state. Oₜ = Sᵃₜ = Sᵉₜ. This is formally defined as the Markov Decision Process (MDP).

Partially observable environments

The agent indirectly observes the environment. Think of it like a robot with camera vision; it isn’t told where it actually is [location].

Agent state ≠environment state. This is formally defined as the partially observable Markov Decision process (POMDP).

In this environment, the agent has to construct its own state representation Sᵃₜ whether this might be…

A proper RL agent generally includes ≥1 of these components:

Policy: agent’s behaviour function
Value function: how good each state/action is
Model: how the agent represents the environment

Policy

The policy is exactly what the agent’s behaviour would be that focuses on mapping from state → action. In a deterministic policy, we generally write that down as

Deterministc policy

Stochastic policy

Value function

The value function is generally the prediction of future rewards. We can use this to evaluate the “goodness/badness” of states which we then can use to select between actions.

Think of it as money. Let’s say that the more money you have, the happier you become. The value function figures out what stage you’re at in becoming happy. If you don’t have enough money, then you’ll be unhappy.

That’s exactly the same with reward. The more reward we have, the happier we become. Our goal is to get as much reward as we can, which we can model through this equation:

Where γ is something known as the discount factor. Our reward has a timescale to how far ahead we look into the future. The way γ works is that it’s a value between 0 and 1 i.e. γ ∈ [0, 1]. Once that value is selected, we exponentiate γ more and more until it essentially reaches 0. That’s exactly the point in time [at our current state] until we don’t care about it.

Model

The model is what predicts what the environment will do next. Why this is important? Modelling the environment and what the future might look like might allow us to make a plan to help you figure out what you want to do next.

This is broken down into 2 main parts:

Transitions: this is exactly what predicts the next state in the environment ex. if i predicted that i should steer 5 degrees to the right, what will the camera input look like for the next frame?
Rewards: now given that transition, what would our next [immediate] reward look like? What would be the reward be if I steered 5 degrees to the right?

We predict the next state and reward in the environment by using information from the current state + action. note: P is our transition model and R is the reward prediction.

Categorizing RL agents

We can mainly break it down in this hierarchy system:

The main difference between policy-based and value-based is that policy-based methods are all about creating that mapping i.e. a= π(s) and use that during learning. You’re building the ideal policy which changes over time given more info to optimize. In the value-based approach though, you just store a value function. The policy itself is derived from the value function i.e. pick the action with the best value.

Let’s say that this was our maze:

Value-based approaches would simply tell you the value for a given action. The RL agent is greedy. It wants to increase its value as fast as it can so it’ll just follow the numbers.

But with policy, instead of representing numbers based on each state, we can explicitly represent the direction/map that the agent should follow and just simply work in that space. Notice that the value function never explicitly stores the value function.

The actor-critic algorithm simply stores the information of these 2 (policy + value functions) methods together.

We also have something unique known as model-free agents. Model-free means that we not try to explicitly understand the environment around us. Instead, we directly try to implement a policy function. We just get experience and figure out the policy of how to behave to figure out the best way to get a reward. We don’t overcomplicate the problem as it is.

Alternatively, we have model-based agents. For this, we need to build a model of how the environment works. And then from there do we decide where we want to go ex. model-based would model a helicopter and its environment in the air and then decide what direction should i be moving in.

We can group this all in a venn diagram:

Problems within Reinforcement Learning

There are generally 2 fundamental problems in sequential decision making that we try to solve:

the environment is initially unknown
the agent interacts with the environment
the agent improves this policy

We define this to be the Reinforcement Learning problem. this is all about trial-and-error making. I’m sure you’ve heard of the walking analogy; a baby tries to walk on its two feet. But it falls at first. Over time, it then starts to better understand how to walk and eventually, just walks!

The other problem we like to define is known as the planning problem. This is completely different:

the model of the environment is fully known (ex. we know the weather outside, wind blowing against the self-driving car)
agent performs computations with this model
agent then improves this policy

Notice the difference here. The latter is given “privileged” information about the environment while the former does not. If I was playing Forza, I would be given information such as how to drive if I was using a planning approach. I would not if I were a normal RL agent without any info.

RL is generally a trial-and-error problem. The goal is to discover a good policy from its experiences of the environment without losing too much reward. But we need to find this balance between exploration to see if there’s more reward or exploration and using already-known information.

What does this look like? If I wanted to eat dinner, would I try to explore and find new restaurants that I might love or exploit and stick with my current favourite restaurant?

Prediction and control

Prediction is all about evaluating the future when you’re given a policy while control is all about optimizing the future while trying to find the policy that does so.

For example, prediction is asking “if i continued driving straight, how much reward would i get?” vs if i was talking in control terms, “which direction should i walk to then figure out the best reward i can get?”

Markov Decision Process

This is what we use to formally describe an environment in RL where we can see everything in the environment [is fully observable]. All the information that there is presented to our agent. What this means is that the current state of the agent completely characterizes the process in the way in which the environment unfolds depending on that state which we already know [because the environment is fully observed].

Generally, almost ALL RL problems can be formalized as MDPs ex.

optimal control ex. steer a self-driving car usually deals with something known as continuous MDPs (regression)
partially observable problems can be converted into MDPs
bandits (expansion of exploration vs exploitation) are generally MDPs with one state

recap: the markov property is when “the future is independent of the past given the present”

where S is our current state. The state completely characterizes all the information that we know. Because all previous state information is irrelevant, we can discard it.

For any Markov property, if you start in some state s and you have some successive state s’, then it’s possible to find a probability of transitioning from the current state to the successive state.

Remember that Sₜ has all the information in the current state. There could be a well-defined transition probability that I’m currently in a state that I already was in before (the past), there’s some probability that I will transition to some next state. ex. if an agent is at a yellow light, then there’s a set probability that i speed through it or slow down. All these outcomes are characterized in the state that it was in before (the state when you see that yellow light). This is defined as

So, it’s the probability of transitioning from state s → s’ that gives is the probability of a random next state (speeding through the light) given the previous state that we were in (seeing that yellow light). We can put this in matrix form:

where each row of the matrix sums to 1 and n is the number of possible outcomes for each state that i was in (ex. one outcome is speeding through, another is slowing down the yellow light).

Formally, the Markov process is a sequence of states i.e. S₁, S₂, … with the Markov property where

This is an example of a “student markov chain”:

What’s even going on here? Let’s say that you’re in “Class 1” right now. There’s a 50% probability that you go on Facebook or you end up staying through the whole class and then go to class 2. If you go onto facebook though, there’s a 90% chance you end up staying on (because you’re now in “that cycle”) or there’s another 10% chance that you shift your focus back to “Class 1”.

Let’s say you make it to “Class 2”. At this point, there’s an 80% class that you make it to “Class 3” or just end up getting way too bored and fall asleep with a probability of 20%.

If you make it past “Class 2,” congrats! You’re now in “Class 3” where you have a 60% of passing or you end up going to the pub for the drink. If you end up going to the pub and have a few shots, there are probabilities that you’d go back to the first, second, or even third class because you forgot all the information.

But if you make it through “Class 3” with the 60% probability, congrats! You’ve now passed the class and will go to sleep.

This is an example of a Markov chain. Let’s look at a sample (a sequence of states) that take you from in → out. For example:

C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep

Think of each jot note as a random sequence that samples these dynamics i.e. the dynamics of that Markov chain (image). The diagram is just one way to look at this chain.

Here’s what that would look like in “state transition matrix form”:

Markov Reward Process

Mark reward processes are generally a normal Markov chain (process) with values. There’s some value judgement that tells you how much reward will you accumulate given a Markov sequence?

Remember that the reward is immediate. “If we start in some state S, how much reward would get from the exact state, exact moment?” What we care about is maximizing the accumulative sum in reward over time. We’re just showing this in one step (Rₛ).

Going back to our chain, let’s [hypothetically] create rewards for every action we take.

We don’t care about the immediate reward we get. We care about the total reward. We can define that by creating the return variable, Gₜ that gives us the total discounted reward from time-step t:

where γ ∈ [0, 1] is the present value of future rewards (recap: γ is a value randomly sampled between 0 and 1 which is the discount factor → what this means is that over time, because we’re exponentiation γ, the value of the rewards at k time steps in the future will ultimately equal 0, thus indicating that they are irrelevant).

Therefore, the value of receiving reward R after k+1 timesteps would be γᵏR.

This concept of judgement via discount factors emphasizes that we care more about short-term rewards more than long-term rewards.

Why do we do this discount factor? Because we do not have a perfect model. ex. if we were building a Markov process to model the environment and we don’t have a perfect model of the environment, we are uncertain about the future. That’s what short-term outcomes are more important to us. It’s also mathematically convenient and avoids infinite rewards (the point where γᵏR=0 is when we don’t care about time step k anymore).

The value function v(s) is quite different in this context — it gives us the long-term value of the state s. If you’re in some state s, how much value will get from there on? What’s the total reward that you’d get from that state on.

If we go back to the chain diagram, how much reward will you get from ex. class 2 to the end?

Formally, the state value function v(s) of a Markov Reward Process is the expected return starting from state s. Formally, we write that down as

The reason why we take the expected return is that we’re in a stochastic environment. We don’t know about every single path that we can take and how that shapes the reward.

Let’s try sampling rewards from our Student Markov Chain (let’s say γ=1/2):

What if γ=1?

This all falls under the Bellman equation. What this states is that the value function can be decomposed into 2 separate parts:

the immediate reward Rₜ₊₁
discounted value of successor state γv(Sₜ₊₁) → the value you get from the immediate reward onwards

Essentially what happens is that v(s) is equal to the expectation of the total reward given the state Sₜ.

Then we simply expand on that equation and can simply factor γ out of the equation (note that Rₜ₊₁ has already been taken out before that factoring).

After factoring that expression under γ, you can notice that the same original equation we defined for v(s) above shows up again → but works when now we’re in time t+1.

This is exactly why the equation makes sense; the value function of state s is equal to its immediate reward + its value function of the next state. That’s what the Bellman equation is.

In this example here, we’re using an undiscounted factor i.e. γ = 1. In the highlighted case of 4.3, what’s going on is that the moment we enter node [which we already have], we get a reward of -2 automatically. Then from there, we multiply the probability of going to the next state and the reward we’d get from that (which is 0.610 and 0.80.4). From there we sum it all up to get 4.3

This is how it’s normally written:

We can also express the Bellman equation in matrix form:

Where v(1)…v(n) is the value function for all specific states where R₁ to Rₙ is our reward function given that state, the P matrix contains the transition probability of going from one state to another, and the v vector at the end is the values of where you end up.

Because the Bellman equation is a linear equation, we can solve for v together:

The computational running time of this solution is defined to be O(n³) given n states, so it isn’t really practical; it’ll only work well with small Markov Reward Processes.

Going back to MDPs…

This is exactly what’s most commonly used in reinforcement learning. We’re also adding one more piece of complexity to MRPs now which is actions. So far, we’ve only looked at calculating the reward and understanding the environment, but we haven’t told the agent what to do with any of this information. We can do this by introducing decisions.

The change here is that the probability transition matrix depends on which action you take. The reward function may also change based on the action.

Let’s redo our Student chain and make it an MDP:

The decisions are now labelled in red. Now you have more control over the path that you want to take.

In order to do this, we need to formalize a way to make decisions. You can’t follow a map if you’re not given the route to follow.

We can do this by creating policies. Policies are generally some sort of distribution over actions given some sort of states:

Given some state s, the distribution we get would give us some sort of mapping to figure out the probability of taking a certain action. The policy function is generally stochastic and the reason why it is so is because it allows us to do things such as exploration.

So to summarize, a policy fully defines the behaviour of an agent. And those policy depends on the current state that we’re in [and not the history because we’re in a Markov state]. The reason why the reward isn’t modelled here is because the state s fully characterizes your future rewards and is time-independent. Our goal is just to maximize the reward in the future, and we don’t need the reward to do so.

The connection between MDP and MRPs is that we can always recover an MRP from an MDP.

If we have some policy that helps us take some actions, the sequence of states that we draw when we follow that particular process is a Markov process.

Once we fix the policy, if we look at the states and reward sequence, that sequence we get is a Markov Reward Process (point #3).

Note that P^π and R^π are the transition dynamics and the reward function which just simply average over our policy. We have the transition dynamics of all the possible things that I could do (ex. if P(walking left) = 0.5 and P(walking right) = 0.5, then I’m just multiply all the probability of the states that I’m going to end up on the left by 0.5 and the same for the right. Those average dynamics define some Markov Reward Process.

We originally defined the value function to be that of the MRP but now, we need to also define it for the MDP. So, our state-value function v_π(s) of the MDP is the expected return starting from state s, and then following policy π:

Now what v_π(s) tells us is, “how good is it to be in state s if I’m following policy π?”

We also need to define the second type of value function, known as the action-value function. We need to also figure out “how good would a particular action be for a given state [that would give us more reward]?”

Let’s go back to our student MDP…

Now we can also define the Bellman expectation equation — the state-value function can be further decomposed into the immediate reward + the discounted value of the next state:

The same also goes with the action function:

The way v_π and q_π relate to each other can be shown using the graph below:

So, for our state-value function here, we’re going to average over the actions that we might take (black boxes) shown by its probability. the q value tells us “how good is it to take that action from that step?”

Now let’s try doing the opposite (inverse) of this — what would happen if we were given a single action value?

What’s going on here is that you’re in a particular state and you’re considering how good is it to go right from that state vs left? We now have to average the dynamics over our MDP; for each situation that might occur for going right, what’s the value of going right which also follows my policy?

We can do this by averaging over everything by taking the probabilities of the transition dynamics (Pᵃₛₛ’) which gives us the action-value function at those roots here.

Remember that v tells us how good it is to be in a particular state while q is telling us how good is it to take a particular action given a particular state.

Let’s put together those 2 diagrams together:

Given the value function for a particular state, we can figure out how good of a state we’re in by taking a 2-step (row) look in it. We’re considering all the actions we might take next (go left/right) and then all the actions the environment might do to us (ex. we might get blown in one direction by the wind). For each of those things the environment might do, we want to know how good is it to be in that state (bottom row of white circles) and carry on with my usual policy (reward).

If we average everything all together i.e. we’re weighing each of the arcs of the second row by the policy we select (left or right), we’re also averaging each arc of the final row, all by the transition probability, Pᵃₛₛ’, we then finally figure out how good is it to be in that particular state (the top node).

Just like with the state-value function, we can also do the same thing with the action-value function:

What is all of this saying? The value function of the current time step is equal to the immediate reward + the value function of where you end up. That’s what everything above is saying.

Going back to our Student MDP:

The state value function (7.4) is broken down into 2 things: studying or going to the pub. That means that each part has a probability of 0.5. From there we need to sum up the probability of going to the next state multiplied by its value.

What we really want to know is the best way to behave in MDPs. We need to find the best behaviour to follow in an MDP.

We don’t really care about how much reward we get following these 50/50 policies. What we care about is the best path that we can take in a system to find the optimal way to solve the problem. Let’s define the optimal-state value function v*(s) is the maximum value function over all the policies. We care about snatching the maximum possible reward and to do that, we need the best policy function.

Once you figure these values out, you’re done. If you went left and got +70 reward but going right would give you +80 reward, where would you go? Obviously right. So, we know that the MDP is solved when we’ve figured out the bets possible performance in the MDP.

Well, what is the optimal value here?

Remember that we’re subtracting at every time step i.e. 6 -(-2) = 8, 8 -(-2) = 10. The red text is exactly what v*(s) is → how good it is to be in each state. But it doesn’t tell you how to behave in it.

We can take the q*(s,a) values to figure out the optimal values for every arc:

We also need to figure out the optimal policy; we talked about the maximum reward we can get, not the policy itself. What does it mean for one policy to be better than another?

One policy is better than another thing if the value function of that policy is greater than the value function of another policy in all states.

There is one “ultra” policy that’s better than all the other policies. Not for just specifics, but for all s.

To find the optimal policy, you just need to solve for q* and then you pick the action that gets you the most q*.

Remember that the policy tells us the optimal math that we should follow for every situation. Let’s do this one last time with the student MDP:

There’s a special “version” of the Bellman equation known as the Bellman Optimality Equation that really tells us how to actually solve the MDP → how do you relate the optimal value function to itself:

The question will always be “what’s the optimal value of being in some state?” We can consider each of the actions to we might take [which take you to the black notes]. And then when we reach that action, we can look at its value (the number on the arc) and then simply take the maximum of the 2 values. Look at the value of each of the actions that you can take and simply pick the maximum q-value.

Now let’s do the other way around q→v. Now we don’t know what the state will be; the wind will have to take us there. Each of those states that we end up in has an optimal value and if we know the value of each of those states that we might end up in, we can average both of them now. No maximum here; we don’t get to pick where the wind blows us. That’ll tell us how good our action is.

Therefore, the optimal action value is the immediate reward + the average over all the probabilities of the wind blowing us left or the wind blowing us right multiplied by the optimal value of being in that state.

Let’s put those 2 pieces together:

Again, this equation relates v* to itself therefore it is recursive which can give us an equation we can solve. So, we’re looking ahead into the actions we can take and maximize over those while also looking over ahead at the dice the environment might roll [but we don’t control the dice therefore we can’t maximize] and instead, we average over the dice the environment throws at us. Now we look at the optimal value where we might end up and back these things all the way up to the initial state which tells us how good it even is to be in that state at the top here.

And now we do the same thing for q*:

If we want to figure out how good an action is, we first consider where the wind (environment) will take us and then average over the dice which the environment is rolling. Wherever the dice roll (the bottom row), we get to make a decision and maximize the decision we take. And for each of the black dot on the final row, we refer to the optimal action value → how good is it to be in this state and taking this particular action and trace it back up all the way to the beginning.

Back to our student MDP:

So we start off with a reward of 6. We look ahead at one step and figure out that the value of the current state (6) is equal to the maximum of the value function of where we end up. You can either follow the arc going up with a reward of -1+6=5 or continue studying and get a reward of 8–2=6. 5<6, therefore we continue studying.

The Bellman Optimality Equation is non-linear. There is no closed-form solution that you can follow. Instead, you need to use iterative solution methods i.e.

Value iteration
Policy iteration
Q-learning
Sarsa

Some quick extensions to MDPs (not expanding on this, just general info):

Infinite and continuous MDPs
Partially observable MDPs
Undiscounted, average reward MDPs

Dynamic Programming

Dynamic: dealing with sequential + temporal components to a problem.

Programming: using a problem i.e. the policy

Dynamic programming is all about finding the optimal program to solve a sequential complex problem. This is generally done by:

Breaking the problem down into subproblems
Combine solutions to solve the problem

Dynamic programming generally needs to have 2 properties:

Optimal substructure → you can solve some RL problem by breaking it down into ≥2 pieces, solving them, and then combining them to find the optimal problem.
Overlapping subproblems → subproblems can occur multiple times and by breaking the problem down into subproblems, there is something for us to gain.

The MDP satisfies both properties → the Bellman equation that we defined above gives us this way to break down the problem ex. “how can we break down the optimal reward function into 2 sub-pieces?” i.e. take one step to the left and then decide what step to take next.

Dynamic programming assumes full knowledge of the MDP and is used for planning in an MDP. There are 2 main cases:

Prediction → given an MDP of our [state, action, transition dynamics, reward, discount factor] and policy, we want to figure out the optimal value function.
Control → full optimization, given an MDP of our [state, action, transition dynamics, reward, discount factor] policy, we want to figure out the optimal policy function.

Policy evaluation

This is the case whenever someone gives you the MDP + policy, and given an action, you need to calculate the reward i.e. “if i continue walking straight ahead, how much reward will i get?”

What’s the problem? evaluate a given policy π

What’s the solution? turn the Bellman expectation into an interactive update system

Let v₁ → v₂ → … → v_π be the initial value function (what’s the value of all states in the MDP). Then, when we look one step ahead with the Bellman equation, we want to figure out a new value function v₂. This process is then iterated multiple times until we find the true value function.

This is done through something known as synchronous backups:

At each iteration k + 1:
For all states s ∈ S:
Update vₖ₊₁(s) from vₖ (s’) where s’ is a successor state of s

So, for every single iteration, we’re going to consider all the states in our MDP, we’re going to start with our previous value function and apply every single state in that value function so that we can produce a new value function of that state. TLDR; is that we run all the states sequentially through the initial value function and then update the value function over time to create new value functions.

What we’re going to do is take the Bellman equation we had before to update the value function (turn it into an iterative function). We plug in the leaves of the previous iteration, vₖ(s’) next iteration vₖ₊₁(s) and back up those values up to vₖ₊₁(s). And now we simply reiterate that process to find the optimal value function.

Let’s say that we have a 4x4 grid with 2 shaded squares in the top and bottom corner. 4 actions (N,E,S,W). Regardless of every action you take, you get a reward of -1 and once you land on the shaded squares, you get no more reward and the episode is over.

We’re optimizing for the shortest path to get to the corner. We can use dynamic programming to solve this.

Let vₖ be the random policy that we’re trying to optimize for. We’re going to start off assuming that every square will have no reward. We then apply the Bellman expectation equation as backup and then we pick up the new value of those states. If we pick one of the states i.e. the one highlighted in yellow, whatever direction that you go in you get a -1 reward but we also look at the value according to our previous estimate which says that we’d get 0 everywhere.

And now we plug in a new value here saying that “if we take a step up, we get a -1 step with a 0 reward.” And now we iterate to k=2.

Let’s do that same lookahead on the square highlighted in red. If we look N,E,S,W (each with probability 0.25), each square we move has an immediate cost of -1 + the value where I end up. Therefore, the cost is -1–1 = -2. The value of my current state will also approximately =-2 given the average taken from looking at N,E,S,W.

Look at the top left square that has a value of -1.7. Remember that if we go north, we just end up in the same square we’re in so we’d just get an immediate cost of -1. Therefore, -1 (N) + -2 (E) + -2 (S) + -1 (W) = -7/4 = -1.75 (rounded up to -1.7).

After we keep iterating and find the optimal function…

The right hand side of the images is all about finding the optimal policy using a greedy policy with respect to our value function, vₖ.

Policy iteration

Given a policy π, how can you return a new policy that’s better than the current one that we have?

This can be broken down into 2 main steps:

Evaluate the policy π

2. Improve the policy by acting greedily with respect to v_π (look at actions and pick which one is best)

In the small grid world example that we have, the improved policy was optimal i.e. π’ = π* where we just followed the values based on the left hand side values. But the more we iterate over this 2 step process, we eventually find the optimal policy that converges to π*.

Let’s use the example of a car dealership:

states: 2 locations, max 20 cars
actions: move 5 cars between locations overnight
reward: $10 for each car rented
transitions: cars returned and requested randomly

What’s the optimal policy for shifting cars around each location to get the maximal reward (people get their car)?

The goal is to figure out what to do given a certain state i.e. move x cars to the other location. Immediately after 1 iteration, we notice that sometimes we should be shifting 5 cars to the other location by acting greedily. From there, we take another step and create a new value function V₄ and then look ahead to find the new policy and then reiterate until we find the optimal policy.

Consider a deterministic policy a = π(s). The way we improve the policy is by acting greedily → pick actions the way that maximizes the action value.

Remember that q_π is the immediate reward + the value function.

If we act greedily, this improves the value from any state s over 1 step (looking at only one step for now):

If we’re taking the max of q_π, this must mean that atleast ≥ one particular q_π which will also improve the value function v_π’(s) ≥ v_π(s) — proof:

The moral of the story: If we pick some greedy policy, the value of the greedy policy (the total of the amount of reward we get by acting greedily) is at least as much as the policy that we had before it. What does this mean? Acting greedily will always never make things worse.

But if improvements stop if we have an equality — we want to make sure that when we hit equality we’ve reached the optimal equation:

which also means that the Bellman optimality equation has been satisfied:

Does policy evaluation always need to converge to v_π and just approximate instead? This is generally known as modified policy iteration.

There are 2 ways people generally deal with this:

Introduce a stopping condition → ϵ-convergence of value function
Simply stop after k iterations of iterative policy evaluation ex. k=3 was good enough to achieve the optimal policy

Value iteration in MDPs

Any optimal policy can be subdivided into 2 components:

Optimal first action A_*
Optimal policy from successor state S’

This brings us to the theorem of the Principle of Optimality:

If we generally know that the solution to subproblems v_*(s’), how do we use that information to build an optimal value function from the previous step? We can find the solution v∗(s) by using the one-step lookahead using the Bellman optimality equation:

Remember that the idea of value function is to make sure that we apply these updates iteratively. We start with the final reward and work backwards by looping over our entire state space.

Let’s create another grid world:

We’re trying to find the optimal (shortest) path to reach g. Generally, one you know that the top left is the goal, then the reward can just be done backwards i.e. the cost adjacent to g would be -1, -2 beside the states that are -1, and then you can simply propagate through.

In synchronous programming though, we don’t know in advance where the goal is. Instead, we need to update every single state of all 0’s, and then update by doing the 1 step lookahead and just working from there.

Summarizing Dynamic Programming Algorithms

Generally, all of these algorithms are based on some sort of state-value function i.e. v_π(s) or v_*(s). The complexity O(mn²) per iteration, for m actions and n states when we use synchronous evaluation. This could also apply to action-value function q_π(s, a) or q_∗(s, a) where the complexity is O(m²n²) per iteration.

Do we have to update every single state in our algorithm? Of course not! We can instead use asynchronous evaluation where we can just pick any state we want to be the roof of our backup, and then you simply backup on that state and move immediately. You just plug in the new value function at the bottom (remember our graph) and just work our way again. As long as you select every state sometimes, then your algorithm will still converge to the optimal value function with lower compute.

There are 3 main ways this is being approached:

In-place dynamic programming
Prioritize sweeping
Real-time dynamic programming

In-place dynamic programming is more of a programming trick but generally, synchronous value iteration stores to copies of the value function → v_old and v_new. In place is simple; just simply store only one copy of the value function and just keep the latest value.

We can update the states using Prioritized Sweeping. What this does is allow us to measure how important it is to update a certain state depending on how important we consider it to be. via the Bellman equation:

Real-time dynamic programming is mainly about selecting the relevant states that are useful to the agent by using real-world experiences to guide the selection of states.

Model-free Prediction

This is all about taking estimating the value function of an unknown Markov Decision Process and its dynamics + reward.

Monte Carlo RL (MC)

MC methods are straightforward; you’re just simply learning directly from the episodes of experience. What makes it model-free is that there isn’t any knowledge of MDP transitions/rewards. It learns from complete episodes by using the simple idea that value = mean (average) return. Caveat: can only apply MC to episodic MDPs and all episodes must terminate.

The goal is to learn v_π from episodes of experience under policy π where S₁,A₁,R₂,…Sₖ ~ π. Look at the stream of states, actions, and rewards we sample from our experiences. What we look at is the return of the total discounted reward where Gt = rₜ₊₁, γRₜ₊₂ + … + γᵀ⁻¹Rₜ.

From there, we calculate the expected return of the value function:

Instead of using the expected return, the Monte-Carlo policy evaluation uses the empirical mean return. The question is how do we do this when we can’t reset our state back to the point repeatedly and we need to figure this out for all states in our environment. How do we get the mean states from just trajectories. There are 2 ways we can do this:

First-Visit Monte-Carlo Policy Evaluation
Every-Visit Monte-Carlo Policy Evaluation

First-Visit Monte-Carlo Policy Evaluation

Imagine that there is some loop in the MDP where you come back to the same state repeatedly:

So, by using more and more random variables, the mean of those random variables approach the actual expectation of the value. Therefore, as we see enough episodes, the idea of Monte-Carlo Learning will really converge on the true example.

Every-Visit Monte-Carlo Policy Evaluation

Now, the main difference is that we consider every time step instead of the first time step:

Here’s an insightful post that better explains the difference between First and Every-Visit.

Let’s better understand this with blackjack:

Let’s see what this looks like with Every-Visit Monte-Carlo Learning:

Because the useable ace is rare, there’s initially a lot of noise (as indicated in the top-left graph) but over time, by trying to play blackjack and just learning by doing, the agent get’s the hang of it.

The value function depends on every single aspect of the environment; the dealer’s hand, the deck that you have, the amount of competitors. But we don’t tell the agent anything about this. It just learns over time…

note: instead of using mean, we can also use incremental mean rather than the mean to update as we see the sequence:

The new mean is the old mean + some set size towards the difference of the new element and the mean before it. We can implement this with Monte Carlo to get Incremental Monte-Carlo Updates.

By using α, we can introduce an exponential forgetting rate to make sure that we discard information in the past that we may not need.

Temporal-Difference Learning (TD)

Like Monte-Carlo, TD methods learn directly from episodes of experience + interaction of environment and is also model free i.e. no knowledge of MDP transitions/rewards are required. The difference is that TD learns from incomplete episodes by allowing the model to estimate. This process is known as bootstrapping. TD also updates the guess of the value function towards the guess.

We use the Simplest temporal-difference learning algorithm so that we update our value V(Sₜ) towards the actual estimated return after one step Rₜ₊₁ + γV(Sₜ₊₁) which breaks down into the immediate reward Rₜ₊₁ and the discounted value of the next step i.e. γV(Sₜ₊₁) (like Bellman equation).

We then simply substitute Gₜ by using the estimated reward instead of the real return. That is what the TD target is while we also have the TD error to check the difference.

Let’s use an example:

How do we update our value function based on this experience?

Notice that with Monte-Carlo, we update towards the actual outcome of 43 minutes, and update as you go through each part of the way. TD learning is different — you update based on every single step and then bounce around based on every step until you get to the end of the episode.

So then, what are the main differences? TLDR; Monte Carlo learns after the full episode, TD can learn after every single step.

Let’s also take a look at the bias/variance tradeoff:

Notice that the TD target (2nd jot note) has significantly less variance than the unbiased estimate (1st jot note). Why? Because Gₜ predicts infinitely while the TD target has a fixed range of predicting the reward one timestep into the future. There’s less noise. Monte Carlo needs more samples to learn. Generally,

Let’s look at an example of a random walk, simple MDP to better elaborate our goal. You only get a reward of 1 when you get to the right side. This is what applying TD would look like:

Initially, at the 0th episode, every state has an equal estimated value of 0.5. But through more and more training, the TD algorithm understands that moving to the right gets you higher reward and will therefore help you reach the end of episode.

Let’s look at the Root Mean Square error of Monte Carlo vs Temporal Difference:

The RMS(E) is averaged and then taken with respect to the ground truth for the actual estimated values (look above). With more step sizes, you can see that TD is substantially more efficient than MC.

Let’s say that we recorded only 3 episodes and fed them into MC and TD algorithms and keep iterating on those episodes for learning again and again. What would happen?

So initially, you start with state A and get a reward of 0. You then transition to state B but also get a reward of 0. But when you start with B, you get a reward of 1 and onward until the last step. What would the value function be?

That is what your MDP model might look like. This is the best explanation to understand the amount of data.

Monte Carlo always converges to a solution with the minimized mean-square error while Temporal Difference converges to a solution of the MDP that best explains the data. Note that N is the amount of times that we’re in a certain state and action while the reward is just the mean reward.

The TD exploits the Markov property which makes is more efficient because it actually makes use of it. MC though, does not exploit the Markov property which can make it less effective, but it performs well in non-Markov environments.

Let’s look at the Monte-Carlo backup:

We start with some state, we choose an action, and so on until that path gets terminated. So the question becomes “how can we figure out the value function of Sₜ?” What Monte-Carlo does is that it samples one trajectory here (highlighted in red) and then uses that sample to update the value function.

Temporal difference is quite different:

We sample the environment, state, action and then back up towards the value function to calculate the reward.

Dynamic programming also doesn’t look ahead like TD but we also don’t sample:

We have to know the dynamics to compute the full expectation but now we consider rₜ₊₁, take the full expectation there, and then sum those values up, weight their probability, take the maximum of their actions, and figure out the optimal path. You can also Dynamic programming all the way to the end and this is generally known as an exhaustive tree search.

Monte-Carlo samples the full environment while TD and DC bootstrap. This is what the unified view of RL algorithms would look like:

There are a spectrum of methods between the shallow backups and the deep backups. There’s a unifying algorithm that either end of the special case. This is known as the TD-λ algorithm.

We can define this to be the n-state prediction where we can look into as many states as we want. Remember that when n=∞, then we’re in Monte-Carlo Estimate and not TD-λ.

Which n is the best? We need to simulate and see the best n via testing. Let’s use the random walk example again:

Notice that as n approaches ∞, we get a very high error, just like MC. The goal is to figure out an algorithm that gets the best of all n at once. We need to average the n-step returns.

We don’t need to just take one n; we can use multiple and average over them and call that the target. How can we efficiently use multiple n and be able to properly average over them? We can use the λ return:

The λ-return is this constant that is a geometrically weighted average of all n. The idea is that λ is between 0 and 1 which tells us how much we’re going to decay the weighting of each successive n. It’s simply a normalizing factor. What does that weighting look like?

Geometric weightings are easily computable and are efficient. We don’t need to store more compute for every single return. We can also have it so that λ varies bu we don’t atke that into account.

Now, we want to update our value function towards that λ-return which means that our forward-view looks into the future will be so that we can compute G^λₜ.

So, the eligibility traces is an idea used to combine both the frequency heuristic and the recency heuristic together. Every time we visit that state, we increase the eligibility state. But then we decrease it over time (notice now eligibility state goes up based on the time of visit to a state).

Now, when we see an error, we update the value function in proportion to the eligibility trace; the proportion to how much critic was assigned to be in that state. This also brings us to the backward view TD algorithm:

What we do instead is that we update the value function for every state in proportion to that TD error and eligibility trace that we get. It’s the one-step TD error because we look only 1 step into the future. We broadcast it back to every state in the past and the value function of each of those states are updated in proportion to the TD error.

The other extreme is that when λ=1 where in that case, the credit we get is deferred all the way until the end of the episode. If we’re in an episodic environment, what’s really cool is that if you accumulate all the updates that you do in the backward-view, you actually end up getting the same sum of updates as the forward view. Therefore, they’re both equal.

Model-free Control

Everything that we’ve talked about above is “leading up” to this moment. Model-free control is all about answering the question: “If you dropped an agent into some unknown environment, how can it figure out the right thing to do i.e. maximize its reward?” → optimizing the value function.

There are 3 main ways we can address this problem:

On-Policy Monte-Carlo Control
On-Policy Temporal-Difference Learning
Off-Policy Learning

The main difference between on-policy and off-policy learning is that on-policy is about learning “on the job” while off-policy is more about learning while following someone else’s data.

On policy → learn about policy π from experience sampled from π

Off policy → learn about policy π from experience sampled from µ

When is Model-free control used? Generally, some problems that can be modelled as MDP include:

Elevators
Helicopters
Protein folding
etc.

For the majority of these problems, either:

The MDP is unknown, but we can sample experience
The MDP is known, but is too big to use, except by samples

Model-free control can solve these problems.

On-Policy Monte-Carlo Control

Is it possible to simply substitute the Monte-Carlo policy instead of the dynamic programming one to evaluate the probabilities? And then iterate from there until we reach the optimal policy?

There are 2 problems with this:

The only way you can act greedily for improvement over V(s) is by having an understanding of the MDP via the state value function. I can’t look one step ahead because I want to be model-free.
Exploration issue → the model does not think about sampling from other trajectories because it might not see a better path with faster reward

What we can do instead is the action-value function by enabling us to do control over a model-free setting.

What would that look like now?

Now we start with our q-value function (action-value function) and at every step, we do a Monte-Carlo policy evaluation (running a bunch of episodes) and then take the mean of each state-action pair to figure out how good each action pair is.

This gives us a new q function of which we then act greedily with respect to that function which gives us a new policy. We continue to run that policy again and again to get q_* and π_*.

But there’s still another issue again — we’re acting greedily = we might get stuck.

Let’s look at an example:

We need to figure out which door to go through. If you open the left door, get a reward of 0. But if you open the right door, you get a reward +1. If you’re acting greedily, then there’s only one logical thing to do — continue opening the right door. And now you get a reward of +3. So now the Monte-Carlo reward is +2 (mean of the reward i.e. 3+1=4/2=2). And now you just keep on opening this door, forever.

The problem is that we don’t know the value of the left door — you need to carry on and explore everything so that you understand the q value of all the other actions. How do you do that?

Use ϵ-greedy exploration → with probability ϵ, you just simply choose a random action. With probability 1-ϵ, choose a greedy action.

The guarantee here is that we keep exploring and that we get an improved policy.

ϵ-greedy really is an improvement. If you evaluate that policy, the question here is that can you be sure that by taking an ϵ-greedy step, can we be really sure that v_π’ is the better policy? The proof above says yes.

The idea of Monte-Carlo is to always act greedily with respect to the most recent estimate of the value function. After one episode, if you can already update the value function to something better, why keep the old estimate of the value function?

How can we really guarantee the optimal policy so that we actually reach q_, π_? There are 2 main things we need to balance:

Continue exploring
We want the best policy which might mean that we get to a policy where we’re not exploring anymore.

We can balance this by using Greedy in the Limit with Infinite Exploration (GLIE). The idea is that we can come up with any “schedule” for learned exploration such that 2 conditions are met:

We explore everything and all states + actions are visited often
The policy eventually converges on the greedy policy so that we can satisfy the Bellman equation.

The best way to do this is by having an exponentially decreasing ϵ:

What does that look like in practice?

If you let this look in an MDP, it will find the right solution. It’ll also be more efficient this way where you update the q value every single episode and immediately improve the policy vs. generating several batches of episodes.

On-Policy Temporal-Difference Learning

We’ve already seen that Temporal-Difference (TD) learning is better than Monte-Carlo (MC):

Lower variance
Online
Can deal with incomplete sequences

So naturally, it would also make sense to sue TD instead of MC in our control loop. How?

Apply TD to Q(S,A)
Use ϵ-greedy policy improvement
Update policy + Q function every time-step

This is called SARSA.

It’s pretty straightforward why it’s called SARSA. We start off with the state action pair as the black dot, we sample a new policy and get our reward R, and then sample our new state S’, sample the policy again of the next state, and we get the new action A’.

The equation is also intuitive. We’re going to “move” our Q value a little bit to the direction of our TD target (R+γQ(S’,A’)-Q(S,A)).

So at every single timestep, we’re going to move up and down. Each time we (agent) takes a step, we update our value function by applying one step of SARSA for the Q value, and then improve the policy greedily.

What does the algorithm look like?

Sarsa will converge to the optimal action-value function by using the GLIE policies while making sure that your step sizes follow the 2 conditions below:

The second point states that eventually, your q values become smaller and smaller and smaller until they become 0 or else there’s going to be noise. In practice, we don’t really care about the second point and emphasize the first point but it works.

Let’s look at an example of the Windy Gridworld:

The goal is to get from S→G but with every single step that we take, the wind pushes us up the number of grid cells mentioned on the bottom. What would happen if we used the naive version of SARSA?

Notice that it took 2000 time steps until SARSA finished an episode. Once if completes it, it goes up exponentially.

Can we get the best of both worlds i.e. unbiased behaviour from MC and control the bias-variance tradeoff of TD? Let’s use the n-step SARSA algorithm:

We look at the one-step return and then the Q value after one step (n=1) but we can continue doing that all the way down to ∞ which is essentially MC.

The Q return mentioned in the second point is the generalization to any n. The n-step Sarsa updates for each state-action pair, instead of updating towards the one-step estimate of our value function using Sarsa, we’re going to use the n-step return instead. This can be done for any n.

Let’s look at the λ version of this sampling:

This is known as Sarsa(λ). The idea is that we average over all of our n-step returns by weighing everything with respect to λ (take a look at the expressions under each node tree). This is then plugged into the update where we take the weighted sum and update the q value in a little bit direction in the target (q^λₜ).

The problem is that we’re looking forward in time, therefore making this an online algorithm.

Just like what we did with TD-λ, we’re doing to do the same with Sarsa(λ) by creating an eligibility trace:

Think of this as a way to quantify the amount of “blame” (critic) you should assign to every action you took in every state. The eligibility trace allows you to track + figure out “who is responsible for a certain state?” Generally, the action that we took most recently + the ones that we’ve repeated several times should be the ones blamed.

The second equation in the first point just states that “if I actually visited a certain state-action pair, then I’m going to increase the eligibility trace by 1.” The algorithm looks like this:

We initialize the eligibility trace to 0 for each episode and for each step for the episode, we take our actions using the policy, compute our TD error, increment the eligibility trace, and then update the Q function and the eligibility trace.

What does this look like? Let’s look at another gridworld:

The goal is to get to the tree. Let’s say that initially, you randomize all your state-action values to 0. What we’re going to do is indicate the value of those state-action pairs by an arrow with the direction of that state-action pair (3rd grid) with the size of the arrow indicating how big the q-value is.

The updates would look like this — if you did one-step Sarsa, when you hit the end goal, you get a reward of 1 in a grid initialized of all 0’s. And now you need to adjust the values of your trajectory to what actually happened. What changes? The end grid cell is what gets updated while all the values become propagated backwards by one step.

The λ parameter is what indicates how far you should be propagating backwards.

Off-Policy Learning

We discussed so far about on-policy learning where “the policy which I’m following is the policy that I’m learning.” But there are cases in which we want to evaluate some other policy while following one policy.

Why is this important?

Can learn from observing humans or other agents and figure out how to solve the MDP
Re-use experience generated from old policies π₁, π₂, …, πₜ₋₁
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy

There are 2 mechanisms that deal with this:

Important Sampling
Q-Learning

Importance sampling is all about estimating the expectation of a different distribution:

The expectation of your future reward is the sum of those probabilities multiplied by how much reward you got. We then just multiply and divide by some other distribution which can give us an expectation.

How is it used in Off-Policy Monte-Carlo?

We apply Importance sampling to the whole Monte-Carlo trajectory. The problem is that it just has extremely high variance because it samples throughout the whole episode which is why it isn’t used in practice.

This is why we have to use TD learning instead:

Because we’re bootstrapping after one step, it becomes a lot simpler because we only need to report for one sample. We update the value function in the direction of the TD target and correct our distribution over it.

But the ideal way to work with Off-Policy Learning is Q-Learning:

We’re trying to look at a specific cause where we can leverage our Q values and the action values to do off-policy learning in an efficient way that doesn’t need Importance Sampling.

The concept is straightforward; we select our next action using the behaviour policy while considering the alternative successor action had we been following our target policy. And then from there, we update the Q value based on the state we started in, the action we took and update towards the value of our alternative action.

In the case we’re trying to learn a greedy behaviour, then we now allow both the behaviour and the target policies to improve.

At every step, we make our target policy greedy with respect to our Q function while the behaviour policy will be ϵ-greedy with respect to the Q function as well. This is known as SARSAMAX:

We’re updating our Q values a little bit in the direction of the best possible next Q value that we can have after one step → like the Bellman Optimality Equation.

So, to recap, here are the relationships between DP and TD:

Value function approximation

Games have a lot of states ex. Backgammon has 10²⁰, Go had 10¹⁷⁰ states, but continuous state spaces such as controlling a car has an infinite number of states. You can’t create a table and just use that scale up. It isn’t gonna work.

What’s the goal here? How can we scale up the model-free methods for prediction and control from the last two lectures?

The problem with large MDPs is that there are way too many states and/or actions that we need to store in memory. What’s the solution? We can estimate the value function with function approximation.

What we’re going to do is consider the true value function v_π(s) and create a new value function that estimates this thing everywhere (v hat). w is just some weight (think of it as the weights of a neural network) that we feed into v hat to estimate the value function of the state space. The goal is to generalize from seen states to unseen states while parameter w is updated using MC or TD learning.

Think of this image as 3 different types of value approximation functions that you can sue. The idea is that if we’re trying to do state value approximation (v hat), we have the “black box” which takes in the state, the box that tries to map from the state using internal parameters w, and then outputs how good the state is.

For value function approximation, there are 2 main ways we can approach it: action in (2nd box), action out (3rd box). In the action in case, we input the state and action with the goal of the black to figure out given the input state and considered action, how good would that be? We then fit everything in the black box which returns us those probabilities with parameter w.

Action out is different. Given an input state, we want our approximation function to give us all the actions that I might take all in one forward pass. You get all the information you need from just a single pass.

What do approximation functions look like:

Linear combinations of features
Neural network
Decision tree
Nearest neighbour
Fourier / wavelet bases

We want to focus on differentiable approximation functions i.e. where we can update our parameter w because we know the gradient and how it’ll affect the output (the ones are bolded). The data that we’re dealing with is not supervised. We require a training method that is suitable for non-stationary data.

Gradient Descent

The gradient will tell us the direction of the steepest descent; our goal is to follow this slope as far down as we can get to the local minimum of J(w) and then adjust from there.

What’s the goal? to find the parameter vector w which minimizes the MSE between approximate value function v hat and the true value function.

Each state can be represented as a feature vector that can contain info about our states (location, weather, etc.). The goal is to find a linear mapping between the features:

Because our objective function is MSE, that means that our objective function is quadratic which makes it really easy to find the optimal parameter values.

We cheated using these methods a lot. There is no supervised method in RL. We need to figure out how to directly find experience. How do we do this? Monte-Carlo and Temporal-Difference! Why? We can get a target to use as our function approximator instead of having labelled data (oracle).

For Monte-Carlo Learning, we change our weights of the parameter vector w in the direction of error between the actual return and our function estimator multiplied by its gradient.

TD Learning uses the bootstrap instead. What we’re going to say is instead of waiting for all the way until the end of the episode to figure out the total reward, I’ll just simply take a step, estimate the reward, and use that as the “ground truth”. We use the TD estimation as labelled data to update the gradient of our parameter w. Same idea with TD-λ where we use λ as the training data.

The same idea for TD-learning will work as well:

Again, notice that our corresponding label is the TD target. Same idea with TD-λ.

What does control for the value approximation function look like? We’re still going to build on the idea of generalized policy iteration. But now we’re going to use approximate policy evaluation where we start off with the parameter vector w and act greedily with respect to our neural network to find a new value function and then update the policy.

We need to also approximate the value function. So, for each state in the action-value function, we’re trying to predict the expected reward and then minimize the MSE again to find the local minimum using SGD:

Now we need to represent our state and action as a linear combination of feature vectors:

We can use the same idea for Monte-Carlo Learning and Temporal-Difference:

Let’s look at the Mountain Car experiment:

The state space is 2D: we have the position of the environment and the velocity we’re going at. The more episodes + steps we go at, the more distinct shape emerges.

TD doesn’t follow the gradient of any sort of objective function which is exactly why it can diverge when off-policy or using non-linear function approximation. We can instead use the Gradient TD function because it follows the true gradient of projected Bellman error:

How does this compare against control algorithms?

Gradient descent is very simple to set up and is appealing to use. The problem? it isn’t sample efficient. When we see an experience once and take our update, we just simply adjust our function and then throw that experience away and move on. The problem? It isn’t data efficient — we haven’t found the best valuable fit to our value function. Batch methods seek to find the best fitting value function given the agent’s experience (“training data”).

What does it mean to find this best fit? We can use something like the least-squares algorithm to find a proper fit the value estimator to our true value function:

There’s actually a very easy way to find the least-squares method: experience replay. What we do is that instead of throwing away our data, we turn it into an actual dataset. At every timestep, we sample a state + value from that experience and do an SGD update on it (just like supervised learning) until we find the least squares solution:

Deep-Q-Networks (DQN) uses experience replay and fixed Q-targets:

Experience replay stabilizes the network because it decorrelates the trajectories because you shuffle + sample different experiences for better correlation fit. The other idea is to have 2 different Q-networks where we basically freeze the old network and bootstrap towards our targets.

What does that look like in Atari? We use end-to-end learning of the values Q(s,a) given the pixels (which are our state s) that are fed into a Convolutional Neural Network.

How did it do compared to humans?

Everything to the left of that grey threshold are instances when the RL agent did better than a human.

We have this method where we use several iterations to hit the optimal solution. But what if you don’t have many iterations to train with? With more computation/iteration, the more you can do in one step to get to the local minimum faster. What does that look like?

When we hit the minimum of LS(w) then that technically means we’ve found the optimal solution. To recap:

These have better convergence policies compared to the normal, incremental MD/TD.

We apply this by using the Least Squares Policy iteration algorithm where we evaluate the policy by least-squares Q-learning with a greedy policy improvement.

Policy gradients

Generally, we’ve focused on creating a policy from the value function ex. using ϵ-greedy. We actually parameterize the policy and not the value function i.e. we control the parameters of π which affects the distribution by which we’re picking actions (remember: this is model-free RL). We need to scale and find parameters that allow us to generally scale.

We need to better understand how we can actually take this parameterized policy (and its parameter vector u) and adjust its values to get more reward via. gradient ascent — if we follow the gradient, we move uphill in such a way that we improve our policy.

Sometimes, it’s more efficient to represent the values of the policy than the value function itself, we might also have situations where the value function could be complicated vs figuring out the action.

Main advantages:

Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies

Disadvantages:

Converges to local rather than global optimum (naive policies)
Evaluating a policy is inefficient and has high variance

How does all of this work?

If I always start in some state s₁ or if I have some sort of distribution over start state s₁, what’s the total reward that I’ll get from that state onwards?

The other way is the average value or we can also consider taking the average of my immediate rewards over the entire distributions of states that I visit.

How do we optimize all of these methods? Policy-based RL is an optimization problem. Our goal: find θ that maximizes J(θ). Generally, we can get greater efficiency when using gradients i.e. gradient descent, etc.

Let J(θ) be the policy objective function we’re trying to optimize for — the goal here is to search for a local minimum in J(θ) by ascending the gradient of its policy w.r.t the parameters θ:

Where α∇_θJ(θ) is the actual policy gradient and α is the step size parameter:

The overall update process:

Let’s assume that we want to compute the policy gradient analytically [assuming that the policy is differentiable when picking actions] and that the now the gradient of our policy. We can use the likelihood ratio trick; when we want to take the expectations of the gradient of the policy, we can use this to move in the direction to get more of something.

By rewriting the gradient in this way, we’re able to take expectations.

We can use the softmax policy to prioritize our policy in a discrete domain ex. for Atari, to know if we should go left vs. right, we can weigh the features of going left or right and whichever is more, we would pick that action:

The score function is very intuitive for the softmax function; it’s just the feature of the action that we actually took (left) subtracted by the average feature for all the actions that we might’ve taken.

Another policy that we can use is the Gaussian policy. We just generally parameterize the mean of the Gaussian and then add some noise to make it stochastic.

Let’s look at this for one-step MDPs:

If we want to increase our objective function (if we want more reward), we just need to move in the direction that’s determined by the score*immediate reward.

We want to do the same thing in multi-step MDPs and to do that, we need to replace the instantaneous reward (score*immediate reward) with the value function (long-term reward). This is defined in the policy gradient theorem:

If you start with some policy, then for any of the objective functions we considered early at the start state, the policy gradient of that is given by the equation in red (expectation of score function * value function Q).

Where do we use this? In the Monte-Carlo Policy Gradient algorithm (aka REINFORCE).

The idea here is to update the parameters with SGD. What we do is that we sample the expectation by using the return as an unbiased sample of Q. We plug that into the policy gradient to give us a direction (estimate) of direction to move. We then adjust our parameters a little bit in the direction of this score*return.

Let’s look at an example of a Puck World:

The idea is that we start with a puck and you need to adjust a little force on it every time step to find the target (pentagon) which moves every so now and then. We just run this and use the Monte-Carlo Policy gradient to update into the direction of those returns.

Notice that we get a really nice learning curve and the scale here (hundred million iterations). It’s slow.

Here’s the TLDR; of the main ideas:

How do we estimate the value function Q? We can use MC policy evaluation, TD learning, TD-λ, or even least-squares policy evaluation. Here’s an example:

Let’s say that we were using a linear value approximation function for the critic; we can figure out Q by just using a state and action * critic.

Remember that policy gradient update for actor = score * critic. For every step of the algorithm (not until end of episode), we sample the transition and see what the reward was, we see where we ended up after picking the previous step. From there we pick our own policy and get the TD error between the value before that step vs. the value after that step, and then update our critic in the direction of the TD error * features [making is linear].

We also update the actor in the direction that gest us more of the things the critic says are good.

Generally, whenever we approximate the policy gradient, we end up introducing bias to the model. The problem with that? Biased policy gradients might not find the right solution. But if we choose value function approximation carefully, then we can avoid introducing any bias.

How do we make all of this better? We have this “family” of actor critic methods where the actor picks the actions and then we have a policy that helps pick those actions. We have a critic that evaluates this and tells us whether a certain action was good or not. And then the actor moves in the direction suggested by the critic.

But how do we really make this better? We can use a trick known as the baseline. The idea here is to subtract some baseline function from the policy gradient in such a way that it doesn’t change the direction of percent (expectation).

What this all tells us is how much better than usual is a particular action a and then how can we adjust our policy so that we can achieve action a.

How do we estimate this function now with our critic? One way to do this is to learn both Q and V by having 2 sets of parameters. We can simply take the difference between those things as our estimate for the advantage function.

But the easiest [and most common] way of doing this is through this idea: the TD error is a sample of the advantage function:

If we use the true value function V^π, the expected TD error = expected reward + the discounted value at the next state — value of the state I was in. The expected reward + the state value at the next state is literally the definition of Q (refer to Bellman equation) which all defines the advantage function A^π.

What about using eligibility traces and different time scales to get a backward view? We can plug this idea into our actors:

We compute the TD error, we build up an eligibility trace, and then the eligibility trace now becomes an eligibility over our scores.

With actor-critic algorithms, there’s a really important question that really needs to be answered: how do we know really if the critic is pushing us in the right direction that we should be following? If you choose the value function approximation carefully, it would be possible to pick a value function approximator that doesn’t have any bias. This is known as Compatible Function Approximation.

Integrating Learning and Planning

We’ve talked a lot about learning a policy directly from experience and learning a value function directly from experience. But how can we actually have a model learn directly from experience and construct a value function or policy? The goal here is mainly to integrate learning and planning into a single architecture. This can be done with Model-Based Reinforcement Learning.

Well what’s the difference between that and Model-Free RL?

This is what Model-Free RL looked like:

This is what Model-Based RL looks like:

What’s the difference? We replace the real world with the agent’s perception of the world. We’ve replaced the real interactions (that the agent has with the world) and replace it with simulated interactions between the model and the environment.

As the agent interacts with the real world (experience), it then builds up its own model about its dynamics, you get a model which you then use to plan. The interactions with the model then create the value and policy which is then used for the model to act. Think of the model learning as a learning MDP and the planning is then we’re solving that MDP. And then you update via iteration.

Main Advantages:

can efficiently learn via supervised learning
can reason about model uncertainty

Disadvantages:

has to learn model, then value function

A model M can be a representation of an MDP and its parameters i.e. <S, A, P, R> which are all parameterized by η:

These are all one-step models ex. reward function would be “what would the reward be at the next time step given the previous state and action?”

What’s the goal here? estimate model M_η from experience i.e. {S₁, A₁, R₂, …, Sₜ}:

General examples of models:

Table Lookup Model
Linear Expectation Model
Linear Gaussian Model
Gaussian Process Model
Deep Belief Network Model
etc.

Table Lookup Model

In the case of a Table Lookup Model, the model will always be an explicit MDP P hat, R hat.

The transition dynamics is straightforward; take the average of how many times I ended up with s’ (state). Same thing with reward; it takes the average of all the times it was in a state and action given s and a. The alternative way is just to record every single experience that you get and just simply sample the model from there.

Let’s go back to the AB example:

What would our model planning look like?

Planning means to literally solve the MDP and find the optimal value function + policy. Remember that we said no to Dynamic Programming algorithms because you have to be told what the MDP is before you apply them. But what we’re saying is that you can to RL with them, all you need is to learn the model and then use that model to do Dynamic Programming.

We can use Sample-Based Planning. It’s straightforward — we use the model only to generate samples.

Going back to the AB example:

We start off with real experience, generate the model, and then use that model to sample experience. We’re extrapolating data (generation) that we can apply to MC learning and then retrain the model. Think of it like this: let’s say that the real experience that you get is from doing homework which then allows you to understand what you’re doing in class. Now, whenever you’re given another problem to solve, you already know how to solve it, you’re just training your mind (model) to become better and better. That’s exactly what’s going on here; we learn from data, create a model, sample from experience, and then retrain.

What happens if our model isn’t right? What happens if it really is inaccurate? We should never expect to get the right answer with RL. The performance of model-based RL is only as good as the optimal policy it is given.

Integrated Architectures

There are 2 main ways we can generate experience: real and simulated experience:

What would happen if we combine both real + simulated experience together? Enter Dyna-Q:

Learn a model from real experience
Learn and plan value function (and/or policy) from real and simulated experience

What does that look like?

Here’s the rundown of the algorithm:

We start off with action-value Q and some model. All that we’re going to do is plan with our table-lookup model in addition to our real experience. So, at every step that we take, we take a real action in the world, see where we end up, and then do 2 things:

usual Q update (Sarsa) towards our one-step lookahead
update our Model in a supervised fashion toward the reward + state that we observed after that one step. We have this in a loop.

And then we can sample n different states, actions, rewards and then “imagine that transition” to apply the Q-learning step to that transition (re-learning from past data). The more we sample different paths, the better the model.

How does Dyna-Q do in the grid world?

If I don’t do any planning (n=0), then then we “converge” after 5 episodes. But by setting n=50, we figure out the optimal policy within 5 episodes.

Simulation-Based Search

We’ve talked a lot about model-based RL. Let’s scope down to the planning problem and how we can use samples of imagined experience for good planning.

We can use Forward search since they’re generally focused on selecting the best action by lookahead. How? We use a search tree with the current state sₜ at the root and then use a model of the MDP to look ahead:

Simulation-Based search simulates those episodes of experience from now with the model by applying model-free RL to those episodes:

What does this look like?

Let’s start with the simplest possible version of all of this: Simple Monte-Carlo Search:

Instead of just evaluating the root actions i.e. going left vs right, we’re going to evaluate every state-action pair that we visit and build a search tree containing all the states and actions that we’ve visited so far and evaluate them.

But the real question is: how do we apply this evaluation method to simulation?

After every simulation, we’re going to improve it. We look at the Q values, maximize over them, and then make them better. The key distinction here though is that we don’t have a complete table of q-values everywhere and are only in the search tree. That’s why there are 2 phases (in-tree, out-of-tree). When we’re in-tree, then we maximize the policy, but when I’m out-of-tree, then I behave according to my simulation policy.

At every simulation, we evaluate the states via MC evaluation, and then improve the tree policy via ϵ-greedy. This is literally MC-Control for simulated experience.

Let’s look at Go (one of the hardest tasks for AI):

What would policy evaluation look like here?

We’re going to create a reward function and simply base it off who wins. We’re playing both sides (black and white) and therefore define policies for both.

MC Evaluation will look like this: we’re in a current state s and we want to know whether we’re going to win or not. We we role out some possibilities via simulation (4 different versions) and then collect the possible reward which is a good estimation.

This approach is very naive. Let’s turn this into a Monte-Carlo Tree Search algorithm instead.

Because black won that game, notice that we get a 1 for it. And now we fill in the value in our search tree (1/1). And then we use the tree policy to guide us as we make more simulation:

What’s the general advantages of MC Tree search?

Highly selective best-first search
Evaluates states dynamically (unlike e.g. DP) and not the entire state space (current position)
Uses sample-based planning methods to break curse of dimensionality
Works for “black-box” models (only requires samples)
Computationally efficient

Monte-Carlo is just one way of a search algorithm. What about Temporal-Difference Search and other model-free RL approaches? We do our simulation-based search again but use TD (bootstrapping) instead of MC and apply Sarsa to sub-MDP from now. Why?

Again, we simulate episodes based on our current (real) state sₜ with the goal being to estimate the action-value fiction Q(s, a).

Exploration vs Exploitation

What even is this Exploration vs Exploitation dilemma? Whenever we have online decision making, we’re left with 2 choices:

Exploitation → Make the best decision given current information
Exploration → Gather more information

I’ve stated this near the beginning of the article: the best long-term strategy might have short-term sacrifices. Why? So that you can gather enough information to make the best decisions.

There are 3 different approaches to this problem:

Random exploration → explore random actions
Optimism in the face of uncertainty → estimate uncertainties based on values and explore the states/actions with the highest uncertainty
Information state space → given the agent’s information of its state, use lookahead to see how that information can help the overall reward

Multi-Armed Bandit

It’s essentially a way to simplify the MDP network where we just have a tuple of <A, R> (throwing away state + dynamics matrix).

Think of each machine as a one-armed bandit — each arm you choose results in a different layout. Which one do you choose next? Remember that the criteria here is the maximize cumulative reward…

We also have something known as regret which is essentially the difference between the optimal value and the action that we chose:

Regret can be thought of as a function of these gaps and counts that exist:

Anytime a gap is large, you need to make sure that you pull that arm very few times; the goal is to always pick the arm with the smallest gap. What’s the problem? The gaps are unknown because we don’t know v*!

What does this regret look like over time?

For most naive algorithms, we just keep on adding regret linearly. Even when we act greedily because we might end up locking on the sub-optimal action. ϵ-greedy might not work as well because we end up exploring the whole time. We want something that has less and less regret given the more time steps of training. Is this possible?

Yes! We can use a Decaying ϵ-Greedy Algorithm instead where it decreases over time. What happens? We end up getting a logarithmic (as guessed from the graph) asymptotic total regret!

You can’t do the schedule in practice (because it requires knowledge of v*) but let’s say someone told us v*. We then use this information to figure out the size of all our gaps. Then, we can just simply invent a schedule to figure out the size of the smallest gap between the best action and the second best action.

But all algorithms have a lower bound meaning that it cannot perform worse than some sort of logarithmic threshold.

Optimism in the face of uncertainty

Imagine that there are 3 arms: the blue, red, and green arms. Each arm is showing the distribution over their q values. The x-axis is the corresponding reward for an action. Remember that the flatter the curve, the more uncertain we really are about picking a certain arm. Which arm should we pick next?

What does Optimism in the face of uncertainty say? Take the one that is the best (green), and take the one that has the potential to be the best (blue). And now, as you try each arm more and more and more, you narrow down the distribution to a specific reward.

How do we estimate this uncertainty though? We use the general idea of Upper Confidence Bounds (UCB). The idea is that I’m not just going to estimate the mean of a particular arm, but also some sort of confidence of where it might be. So for each distribution, we’re not only going to estimate the q value + some boundary to figure out the tail of the distribution.

We pick the action with the highest confidence (we know where the reward will be at). The less we try the error, the larger the confidence will be (uncertain) vs smaller confidence (scoped down to figure out a good range of where our rewards might be).

We apply something known as Hoeffding’s inequality to figure out the rewards of a bandit that has to select action a.

We want to figure out “what’s the probability that ‘”m wrong about my estimates of these q values more than my U value?”

From there, we pick a probability p and then solve for Uₜ(a):

What’s nice about this is that we don’t need to know anything about gaps or rewards and boundaries and now a way to pick the optimal action. And notice that as we pick things more and more and more (Nₜ(a)), then Uₜ(a) will eventually hit 0. We also schedule our p as we observe more and more to become more confident in the q value.

This brings us to the UCB1 algorithm: at every step, you estimate the q value by taking the Monte Carlo estimate (imperial mean of everything you’ve seen so far) and then add the bonus term based on the amount of times you’ve pulled an arm (t) divided by the number of times you’ve pulled a certain arm (Nₜ(a)). And then you simply just pick the highest action value.

How does it do in practice?

What’s interesting is that if you tune ϵ-greedy well, then it can do as well as UCB1. But if not, then it’s significantly worse.

Bayesian Bandits

What we can do with this approach is be able to exploit prior knowledge about rewards p[Rᵃ]

What the Bayesian approach does is (with its experience), update w i.e. the mean μ + variance σ² of each of the arms we could pull.

We compute the posterior distributions of what each distribution looks like after we get information. What’s interesting is that we can use that posterior information to guide our exploration via UCB and/or probability matching. If someone gave us useful + accurate prior information, then it would make more sense to actually use this Bayesian approach. But if it’s wrong, then you should just go ahead with the UCB approach.

The other way to make use of our probability distribution is to use Probability matching. Essentially, Probability matching selects some action a according to the probability that a is the optimal action.

What’s really nice about probability matching is that it’s built to do optimism in the face of uncertainty i.e. the more uncertainty about something, the higher the probability that it could be the maximum value. We do this through Thompson sampling [which is a way to do sample-based probability matching].

This is generally the oldest algorithm for bandits but it actually achieves the lower bound for this algorithm (works optimally well). The way it works is that at every step, you sample from your posterior (randomly sample from distribution). And then whichever arm distribution has the highest sampled value, we just go with that.

Information State Space

We’ve explored 2 types of algorithms so far: randomized exploration algorithms (ϵ-greedy) and UCB. Now let’s look into how we can quantify the value of information.

Exploration is useful in an agent because it gains information about its environment. But how can we really quantify the value of information? 2 main ways:

How much reward a decision-maker would be prepared to pay in order to have that information, prior to making a decision
Long-term reward after getting information — immediate reward

If we can quantify how much long-term reward we might get by picking an unknown arm, how much would that information be worth for me? Can we quantify that?

The best way to do this is to transform our Bandit problem back into an MDP (sequential problem) and use the information state s~ to summarize all the information that we’ve received so far.

But in order to solve Information State Space Bandits, we need to formulate the bandit as an infinite MDP over information states but it can be solved with Reinforcement Learning!

Model-free RL → Q-Learning
Bayesian model-based reinforcement learning → known as Bayes-adaptive RL where you find the Bayes-optimal exploration/exploitation trade-off with respect to prior distribution

The way Bayes-Adaptive Bernoulli Bandits works are that you have some sort of Beta prior. And then for each action a, we have the Beta prior telling us how good that action is. And then every time an action is selected, then we just simply update the Beta posterior.

We can solve Bayes-adaptive MDP can by using dynamic programming formally known as the Gittins index. But recent methods such as MC tree search works and can be used to play a lookahead tree to find the Bayes-optimal exploration-exploitation tradeoff.

To sum everything up so far, let’s just review the Multi-Armed Bandit Algorithms

Random exploration:

ϵ-greedy
Softmax
Gaussian noise

Optimism in the face of uncertainty:

Optimistic initialization
UCB
Thompson sampling

Information state space:

Gittins indices
Bayes-adaptive MDPs

Applying RL in SOTA Games

This is just a quick overview of some of the biggest games and how well RL agents have performed in them:

What does optimality in these games even look like? How do we really know if we’ve actually hit “superhuman” level? Remember that the optimal policy depends on how the other player plays. But we can use the idea of Nash equilibrium to define what optimality looks like:

If all the other players that you play against have a fixed policy π⁻ᶦ except for the ith player πᶦ, then you can create a “best response” against that set of policies. If you all are using the same strategy against me, then I can compute the best strategy against all of you. Nash equilibrium is defined when every single player plays the best response.

The best response will only be the solution to a single-agent RL problem where all the other players become part of the environment (we’re trying to operate in a way so that we can beat our opponents so we need to take them into consideration. In this case, the best response is optimal policy for this MDP.

Nash equilibrium in this case would be the fixed-point in the self-play RL where we can generate experience by playing games between agents. Each agent tries to learn the best response to other players where one player’s policy determines another player’s environment. We can then create a system where all the players adapt to each other.

There are generally multiple classes of games that we’ll focus on:

Perfect information games (fully observable by all players ex. chess)
Two-player games (player 1 = white, player 2 = black)
Zero-sum games where there are equal and opposite rewards for black and white (what’s good for black is bad for white)

How can we find the Nash equilibrium in Zero-sum games?

Game tree search (i.e. planning)
Self-play reinforcement learning

Minimax is straightforward; we maximize white’s expected return while minimizing black’s. ex. if black played the best possible mode for black and white played the best possible mode for white, what would the final outcome of the game be? The minimax pair is just a joint policy for black and white that achieves those minimax values while achieving a Nash equilibrium.

We can use a depth-first search to then figure out the optimal value from a given situation. The problem though is that search trees can grow exponentially and it would be extremely impractical to search until the end of the game.

What we can use instead is a value function approximation (or evaluation/heuristic function) where we use it to estimate the minimax value for each leaf node. You don’t need to search until the end of the game; we can use the value function approximator, get a value from that, and then propagate back upwards.

The simplest way these value function approximations were built was through something known as Binary-Linear Value Functions where we have a binary feature vector that contains information about which features are present in this position.

In this case here, the features present are the white rook, bishop, king, and then the black rook and king. We then multiply this by the actual point value for each piece and then take its product to figure out who’s winning (in this case it’s white with +3).

One of the most famous chess programs ever written was DeepBlue which was the first chess program to defeat a world champion (Kasparov):

It looked ahead 16–40 searches into the future and used that information to determine the next sequence of moves.

What would happen if we applied value-based RL algorithms to games of self-play i.e. chess + checkers to find a minimax solution? We can use Self-Play Temporal Difference Learning:

The idea again is simply to do something like gradient descent to update those parameters w (we’re assuming that all immediate rewards i.e. moving a piece is 0). One thing that’s useful to know especially in games is that you don’t need action-value functions. In deterministic MDPs (games), it’s actually just sufficient to have a state value function and use that instead.

Recently, computer scientists have been taking a radically different approach to game-tree search that has shown to be extremely effective in games + real-world domains known as simulation-based search. The idea is quite simple; self-play RL can actually replace search and we can simulate games of self-play from our root state Sₜ. We simulate experience that starts from now i.e. Sₜ.

MC Tree-Search has actually become one of the best performing methods in challenging games such as Go. But many games can be solved with just a simple MC search where you have randomized rollouts to the end of the game.

Let’s look at this in scrabble:

The way MCS would learn is by playing out the games and choosing what gives us the highest score. How did it go? It beat the world champion with such a simple policy:

What’s next?

Yay! Let’s assume that you’ve read through this whole article + went through Prof. Silver’s lectures. Now how do you really start building and implementing what you’ve really learned?

I would really recommend going through some of OpenAI’s Spinning Up documentation for people who are interested in Deep Reinforcement Learning. There’s a lot of valuable content surrounding that field + you get to code our several Deep RL algorithms!

This YouTube course is also a good resource that I’ve come across to get a good theoretical understanding of Deep Q Learning + Policy Gradients:

Another course by freecodecamp that I’d really recommend is to go through + understand how you can implement RL papers. It’s a great resource to go through and implementing algorithms + papers is something that’s really valuable to do.

Hey 👋 thanks so much for reading through the article! Hope you learned a thing or two about RL and got some sort of direction with it! My goal is to help as many people as I can and with that being said, I’d really appreciate if you could share with others and help them better understand Reinforcement Learning as they explore what they might be interested in!

Contact me

Connect with me on LinkedIn
Follow me on Twitter
Stay on the loop with what I’m building on Substack!