Everything you need to know about Reinforcement Learning in 80 minutes

A quick note!

  • There aren’t supervisors, only rewards
  • Feedback in RL is generally delayed, not spontaneous
  • Time plays a big role for performing in dynamic systems such as RL
  • The agent’s actions affect the subsequent data it receives (closed feedback loop)
  • Fly stunt maneuvers in a helicopter
  • Defeating world champion in games i.e. GO, Atari
  • Investment portfolios
  • Walking robots

Defining the Reinforcement Learning problem

Source

Understanding the agent + environment

Source
  • executes action Aₜ
  • receives observation Oₜ
  • receives scalar reward Rₜ
  • receives action Aₜ
  • emits observation Oₜ
  • emits scalar reward Rₜ

History + state

  1. The agent selections actions
  2. The environment selects observations/rewards

Markov state

Fully observable environments

Partially observable environments

Source
  • Policy: agent’s behaviour function
  • Value function: how good each state/action is
  • Model: how the agent represents the environment

Policy

Deterministc policy
Stochastic policy

Value function

Model

  1. Transitions: this is exactly what predicts the next state in the environment ex. if i predicted that i should steer 5 degrees to the right, what will the camera input look like for the next frame?
  2. Rewards: now given that transition, what would our next [immediate] reward look like? What would be the reward be if I steered 5 degrees to the right?

Categorizing RL agents

Source
Source
Source
Source

Problems within Reinforcement Learning

  1. the environment is initially unknown
  2. the agent interacts with the environment
  3. the agent improves this policy
  1. the model of the environment is fully known (ex. we know the weather outside, wind blowing against the self-driving car)
  2. agent performs computations with this model
  3. agent then improves this policy

Prediction and control

Markov Decision Process

  • optimal control ex. steer a self-driving car usually deals with something known as continuous MDPs (regression)
  • partially observable problems can be converted into MDPs
  • bandits (expansion of exploration vs exploitation) are generally MDPs with one state
Source
Source
  • C1 C2 C3 Pass Sleep
  • C1 FB FB C1 C2 Sleep
  • C1 C2 C3 Pub C2 C3 Pass Sleep

Markov Reward Process

Source
Source
Source
Source
  1. the immediate reward Rₜ₊₁
  2. discounted value of successor state γv(Sₜ₊₁) → the value you get from the immediate reward onwards
Source

Going back to MDPs…

Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
  • Value iteration
  • Policy iteration
  • Q-learning
  • Sarsa
  • Infinite and continuous MDPs
  • Partially observable MDPs
  • Undiscounted, average reward MDPs

Dynamic Programming

  1. Breaking the problem down into subproblems
  2. Combine solutions to solve the problem
  1. Optimal substructure → you can solve some RL problem by breaking it down into ≥2 pieces, solving them, and then combining them to find the optimal problem.
  2. Overlapping subproblems → subproblems can occur multiple times and by breaking the problem down into subproblems, there is something for us to gain.
  1. Prediction → given an MDP of our [state, action, transition dynamics, reward, discount factor] and policy, we want to figure out the optimal value function.
  2. Control → full optimization, given an MDP of our [state, action, transition dynamics, reward, discount factor] policy, we want to figure out the optimal policy function.

Policy evaluation

  1. At each iteration k + 1:
  2. For all states s ∈ S:
  3. Update vₖ₊₁(s) from vₖ (s’) where s’ is a successor state of s
Source
Source
Source
Source

Policy iteration

  1. Evaluate the policy π
Source
  1. states: 2 locations, max 20 cars
  2. actions: move 5 cars between locations overnight
  3. reward: $10 for each car rented
  4. transitions: cars returned and requested randomly
Source
  1. Introduce a stopping condition → ϵ-convergence of value function
  2. Simply stop after k iterations of iterative policy evaluation ex. k=3 was good enough to achieve the optimal policy

Value iteration in MDPs

  1. Optimal first action A_*
  2. Optimal policy from successor state S’
Source
Source

Summarizing Dynamic Programming Algorithms

Source
  1. In-place dynamic programming
  2. Prioritize sweeping
  3. Real-time dynamic programming
Source

Model-free Prediction

Monte Carlo RL (MC)

  1. First-Visit Monte-Carlo Policy Evaluation
  2. Every-Visit Monte-Carlo Policy Evaluation

First-Visit Monte-Carlo Policy Evaluation

Source

Every-Visit Monte-Carlo Policy Evaluation

Source
Source
Source
Source
Source

Temporal-Difference Learning (TD)

Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source

Model-free Control

  1. On-Policy Monte-Carlo Control
  2. On-Policy Temporal-Difference Learning
  3. Off-Policy Learning
  • Elevators
  • Helicopters
  • Protein folding
  • etc.
  1. The MDP is unknown, but we can sample experience
  2. The MDP is known, but is too big to use, except by samples

On-Policy Monte-Carlo Control

Source
  1. The only way you can act greedily for improvement over V(s) is by having an understanding of the MDP via the state value function. I can’t look one step ahead because I want to be model-free.
  2. Exploration issue → the model does not think about sampling from other trajectories because it might not see a better path with faster reward
Source
Source
Source
Source
  1. Continue exploring
  2. We want the best policy which might mean that we get to a policy where we’re not exploring anymore.
  1. We explore everything and all states + actions are visited often
  2. The policy eventually converges on the greedy policy so that we can satisfy the Bellman equation.
Source
Source

On-Policy Temporal-Difference Learning

  1. Lower variance
  2. Online
  3. Can deal with incomplete sequences
  1. Apply TD to Q(S,A)
  2. Use ϵ-greedy policy improvement
  3. Update policy + Q function every time-step
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source

Off-Policy Learning

  1. Can learn from observing humans or other agents and figure out how to solve the MDP
  2. Re-use experience generated from old policies π₁, π₂, …, πₜ₋₁
  3. Learn about optimal policy while following exploratory policy
  4. Learn about multiple policies while following one policy
  1. Important Sampling
  2. Q-Learning
Source
Source
Source
Source
Source
Source

Value function approximation

Source
  • Linear combinations of features
  • Neural network
  • Decision tree
  • Nearest neighbour
  • Fourier / wavelet bases

Gradient Descent

Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source

Policy gradients

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies
  • Converges to local rather than global optimum (naive policies)
  • Evaluating a policy is inefficient and has high variance
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source

Integrating Learning and Planning

Source
Source
Source
Source
  • can efficiently learn via supervised learning
  • can reason about model uncertainty
  • has to learn model, then value function
Source
Source
  • Table Lookup Model
  • Linear Expectation Model
  • Linear Gaussian Model
  • Gaussian Process Model
  • Deep Belief Network Model
  • etc.

Table Lookup Model

Source
Source
Source
Source
Source
Source

Integrated Architectures

Source
  • Learn a model from real experience
  • Learn and plan value function (and/or policy) from real and simulated experience
Source
Source
  1. usual Q update (Sarsa) towards our one-step lookahead
  2. update our Model in a supervised fashion toward the reward + state that we observed after that one step. We have this in a loop.
Source

Simulation-Based Search

Source
Source
Source
Source
Source
Source
Source
Source
Source
Source
  • Highly selective best-first search
  • Evaluates states dynamically (unlike e.g. DP) and not the entire state space (current position)
  • Uses sample-based planning methods to break curse of dimensionality
  • Works for “black-box” models (only requires samples)
  • Computationally efficient
Source
Source

Exploration vs Exploitation

  1. Exploitation → Make the best decision given current information
  2. Exploration → Gather more information
  1. Random exploration → explore random actions
  2. Optimism in the face of uncertainty → estimate uncertainties based on values and explore the states/actions with the highest uncertainty
  3. Information state space → given the agent’s information of its state, use lookahead to see how that information can help the overall reward

Multi-Armed Bandit

Source
Source
Source
Source
Source
Source

Optimism in the face of uncertainty

Source
Source
Source
Source
Source
Source

Bayesian Bandits

Source
Source
Source
Source

Information State Space

  1. How much reward a decision-maker would be prepared to pay in order to have that information, prior to making a decision
  2. Long-term reward after getting information — immediate reward
Source
  • Model-free RL → Q-Learning
  • Bayesian model-based reinforcement learning → known as Bayes-adaptive RL where you find the Bayes-optimal exploration/exploitation trade-off with respect to prior distribution
Source
  • ϵ-greedy
  • Softmax
  • Gaussian noise
  • Optimistic initialization
  • UCB
  • Thompson sampling
  • Gittins indices
  • Bayes-adaptive MDPs

Applying RL in SOTA Games

Source
Source
  1. Perfect information games (fully observable by all players ex. chess)
  2. Two-player games (player 1 = white, player 2 = black)
  3. Zero-sum games where there are equal and opposite rewards for black and white (what’s good for black is bad for white)
  1. Game tree search (i.e. planning)
  2. Self-play reinforcement learning
Source
Source
Source
Source
Source
Source
Source
Source

What’s next?

Contact me

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store