Everything you need to know about Self-Driving Cars in <30 minutes

27 min readJun 13, 2024

A quick note!

I’ve spent the past week learning what it means to create self-driving cars that create value while also reviewing its technical intricacies. This article is a version of the notes that I took while going through Andreas Geiger’s autonomous vehicle lectures. It also will serve as a reference to beginners who want to learn more about how self-driving cars work!

The history of autonomous vehicles, jot-note style

1925: Houdina creates the “American Wonder” which is remotely controlled by another operator in a different vehicle
1960: RCA Labs creates a wire-controlled car that allows the car to move on its own through electronic wires in the ground
1970: the Citroen DS19 is able to drive up to speeds of 130 km/h via magnetic fields in the ground
1986: CMU’s Navlab creates its initial demonstration of vision-based autonomy without any sort of infrastructure/radio communication
1986: VaMoRs is demonstrated as the first autonomous vehicle that’s able to drive both autonomously longitudinally and laterally with speeds up to 36 km/h
1988: ALVINN is a neural-network-based approach to driving on 30x32 input images
1995: “Navlab 5” is able to go all across America (2850 miles) with 98% lateral autonomy
1995: AURORA is created as a way to lane-keep by looking at line markings on the ground via a downward camera
1995: Adaptive Cruise Control is created and deployed into vehicles, which creates level 1 autonomy
2004: the first DARPA Grand Challenge occurs in the Mojave Desert. None of the vehicles completes the race, with CMU travelling 11.78km of the 240km route (furthest distance). GPS is also used.
2005: Stanford wins the DARPA Grand Challenge 2 with 5 teams finishing the course
2006: High-resolution LiDAR’s are created that have 360ᵒ vision
2007: DARPA Urban Challenge occurs where traffic and obstacles are put into place. CMU wins the race, while Stanford comes second.
2009: Google’s Self-Driving Car project is started at Google X (which became Waymo in 2016)
2012: Deep Learning methods are introduced into autonomous vehicles i.e. depth estimation, pose estimation, etc.
2015: Uber starts their self-driving car division (which then gets shut down in 2020)
2015: Tesla Model S introduces its Autopilot

The main approaches to self-driving

Modular Pipeline

The main idea behind the modular approach to autonomous vehicles is that you can break the fundamental problem down into several sub-parts (problems) that you can solve independently and then put together. The main problems they’re usually broken into are perception, localization, planning, and control.

Perception + scene parsing is all about detecting vehicles and understanding all the objects that are in your environment. This would include objects such as people and other vehicles, buildings, lane markings, trees, etc. This is mainly done through computer vision and LiDAR-based methods. Algorithms such as bounding boxes, semantic segmentation, depth estimation, optical flow, and 3d map reconstruction as a way to parse through our scene and recreate the environment the car is in.

Notice how in the gif we’re able to place bounding boxes on objects, create an estimation of distances between the car and other objects, and determine the area we can drive on! Source

If you’re using a LiDAR with cameras, then you would generally perform sensor fusion here, where you would combine both your LiDAR representation and camera representation into one big 3D map:

The second step would be localization. The fundamental problem behind localization is the idea that GPS data isn’t fully accurate; autonomous vehicles need to be accurate to the centimetre-level while GPS is at the meter-level. Although GPS can be used as a higher-level understanding of the route that needs to be taken, autonomous vehicles need to know exactly where they are in order to make better decisions (ex. overtaking, slowing down, etc.).

The way localization is approached is through a series of measurement and update steps: you “measure” information in the real world (ex. vision and LiDAR data, along with any prior information you’re given), and then update your possible location from a uniform distribution to the “potential” specific location the vehicle is at.

The third step is planning, which is all about determining the optimal trajectory the vehicle needs to take. The goal here is that given the higher-level route information (GPS), you want to be able to create the trajectory/plan that’ll allow you to get there.

Control is then all about executing the planning stage i.e. steering, brake, and gas. Based on the trajectory created, the autonomous vehicle should be able to adjust your steering + speed accordingly.

The biggest advantage to modular-based systems is the idea that they’re interpretable. Because of the fact that the fundamental problem of autonomous navigation is broken down into immediate steps that humans can understand, it’s easy to debug during systematic issues + deployment failures.

The problems, though, lie in the fact that there is no such thing as “joint training” between all modules. This would mean that different modules are trained on different objectives and different data; none of the modules think about solving self-driving as a whole but instead solve specific aspects of it (ex. get 100% accuracy on object detection when we might not want to waste compute thinking about parked cars).

The primary problem with the modular approach is its reliance on pre-mapped environments and HD maps. Whenever you localize through a modular approach, you need to already have an HD map of the environment you’re driving through: the idea of being able to drop the autonomous vehicle into a new place and have it “magically” work doesn’t exist. Scalability is restricted to the geographical areas that have been pre-mapped with LiDARs and camera sensors.

End2End Learning

End2end learning (aka imitation learning) is almost the opposite of the approach of the modular approach where there aren’t modules that break the task of driving up into separate sub-tasks. Instead, sensory inputs go through a neural network (just like a black box) and output the direct controls needed to drive the car (steering, brake, gas). This not only deals with the problem of “joint training” in the modular approach, but advances in localization allow for these vehicles to be dropped anywhere in the world and be able to operate.

The biggest advantage to end-to-end learning is how easy it is to train and deploy these models; the only things you need are the sensory input (which could be cameras and/or LiDAR) and the output controls. That’s all fed into the neural network, which can learn and train on this information.

The problem fundamentally lies in interpretability. Debugging end2end models and understanding why they made the decisions that they did is extremely difficult as there’s no point of entry to understand the decisions that are made + why they were made. It’s an existing problem that has yet to be solved, but there hasn’t been monumental progress in this area yet.

The goal of end2end learning is to “imitate” the actions of what the human is doing i.e. create some policy that best mimics the actions of the data. Instead of hard-coding policies + breaking down the fundamental problem of driving into several sub-tasks, let’s instead use a data-driven approach where the only restriction on the policy’s performance is the amount of data that it’s given.

Prof. Geiger defines the formal definition of imitation learning as the following:

Imitation learning is all about creating some sort of policy, π that’s able to take in information about the current state s and figure out the best action a given π, while minimizing the loss function 𝔏. This would then be “rolled out” through n amount of frames where you iteratively update your states (which is modelled through a probabilistic distribution) and actions (by repeatedly placing sᵢ into π).

Conditional Imitation Learning

Imitation learning isn’t just enough. You need to be able to incorporate higher-level route information and not just predict for the next frame (i.e. get general directions so that you can decide whether to plan to move to the left, right, etc.). The goal behind conditional imitation learning is to be able to take in a directional command (ex. left, right, straight) along with the current state-environment observation into the end2end model.

Most companies that use end2end learning as a way to solve self-driving will generally follow the conditional imitation learning route (wayve, comma). This is mainly because it’s the easiest way to incorporate navigational information into the model, allowing the autonomous vehicle to be able to function in the real world.

There’s generally two main approaches to “command-conditional imitation learning”: you can have an image encoder and then concatenate state-environment observations with your command, or you can instead create three possible state-action pairs given a command and then execute based on the input command (i.e. I can create hypothetical actions that would be taken if I wanted to go left, right, or straight and then choose the hypothetical path based on the command).

Direct Perception

The “goldilocks zone” between the modular + end2end approach is exactly what Direct Perception is all about; instead of having one big network that goes from perception → control, you instead have a network that goes from perception → low-dimensional intermediate representations, and then a planning + control module.

This paper from the Princeton Vision Group goes deeper into how direct perception works, along with a proposal on driving focused on lane-keeping.

Conditional Affordance Learning

Conditional Affordance Learning is what allows us to scale Direct Perception techniques to inter-city driving.

After you take in video + directional input, you’ll use a neural network to predict affordances (which are your centerline + other distances relative to the lane that the ego-vehicle is in) and then use those parameters as an input to the controller. The only difference here is that directional input is added as an input to the neural network.

Visual Abstractions

Sometimes, intermediate representations of our driving scene can actually allow for better end2end performance of autonomous vehicles. A couple folks from UT Austin experimented with adding intermediate representations to end2end models and found out that those representations resulted in better performance than pure end2end learning.

What do we define as a good visual abstraction, though? Prof. Geiger breaks it down into 4 main categories:

invariant: irrelevant noise from multiple images (ex. colors of cars) should be eliminated
universal: the abstraction should be able to explain the entire environment/driving scene
data efficient with respect to computational power
label efficient: minimal effort should be required

Intermediaries such as semantic segmentation, depth, and optical flow have been mainly used in applications of conditional imitation learning as a way to help the model better understand the environment that it’s in.

Reinforcement Learning

The idea of Reinforcement Learning (RL) in autonomous vehicles has been experimented with for decades now. Instead of having to create extensive datasets and labels to train a policy, you can instead have it directly interact with its environment and allow the policy to become stronger over time as it interacts more and more with the environment. I’ll go over the fundamental ideas behind RL in this section (MDPs, the Bellman Optimality, and Q-Learning). If you’re curious to learn more how RL works, I wrote an 80-minute article going deep into the science of RL.

This is what RL generally looks like:

Markov Decision Process (MDPs)

The fundamental goal of all RL agents is to maximize total future reward, where the agent might need to make short-term sacrifices in order to maximize reward in the long-term. We formalize the Reinforcement Learning problem through an MDP:

All MDPs must obey the Markov property, which is the idea that the current state must be only dependent on the previous state (which implies that all the previous states are irrelevant except for the immediate previous state). This then brings us to the RL loop:

The way the actions are chosen are through our policy, π. There’s two types of policies: deterministic policies (direct mapping), or stochastic policies that try to model the uncertainty of the environment into the policy. The way good actions are discovered is through exploration (a big reason why RL isn’t deployed in the real world and mainly in simulation). The balance of exploration (trying a new action) and exploitation (following through with previous actions) is what allows for the optimal policy to be created.

The idea of balance is mainly done through an ϵ-greedy algorithm, where probability ϵ is your action chosen at random (and probability 1-ϵ being to choose the best action). ϵ starts off as a large value and becomes smaller over time.

Value Functions

The idea behind value functions is that you want to be able to calculate the cumulative (total) reward from your current state and the policy that’s chosen.

Action-Value Functions

Action-value functions are slightly different; instead of calculating the return just based on the current state, you’re calculating the cumulative reward based on the action that’s taken.

Bellman Optimality

Finding the optimal action for both the state-value function and the action-value function is hard because of the infinite amount of states and actions that you can be in. What we can do is use a Bellman Optimality Equation for both functions:

Instead of considering all the possible timesteps into the future, we look directly at the next timestep and recursively iterate through that over time. How do we find Q*, though? This is where other methods, such as Q-learning, come into play:

Deep Q-Learning

Q-Learning can’t scale to high-dimensional spaces because of the amount of possible states that exist. Instead of creating a computationally inefficient Q-table, we can create a function approximation (neural network) to estimate what Q would be:

Our loss function would just be a straightforward MSE in our Q-values, while our gradient update would be with respect to the Q-function parameters:

The problem is that this doesn’t converge in practice. To solve this problem, we need to do two things: experience replay and add delays to our Q-network updates.

We can create a replay memory that stores our new “experiences” (state, action, read, and next state) and then train on those samples. This allows us to train on more recent states and not let older states influence the neural network.

We also change the set of weights that we use in order to reduce oscillations in our target data:

This is what the final procedure would look like:

Take an action based on our ϵ-greedy policy
Store this information (state, action, reward, next state) into our replay memory
Sample a couple of transitions from our replay memory
Calculate the Q-targets using our older parameters (θ⁻)
Optimize the loss function (MSE) via stochastic gradient descent (SGD).

Deep Deterministic Policy Gradients (DDPG)

Deep Q-Learning suffers from the fact that the set of possible actions are discrete and from uniform sampling of our replay buffer. DDPG instead focuses on creating a policy that allows for continuous action spaces. What we can do is create 2 networks: a deterministic policy that maps from a state → action and a critic that helps us model the Q-function.

Experience replay and target networks are also used to make sure that oscillations are minimized during training.

Visual Odometry

Visual Odometry is all about understanding the motion of our autonomous vehicle with respect to images. There’s 2 approaches to tracking pose (position + orientation) of our autonomous vehicle: indirect and direct methods.

Indirect visual odometry

The first step with indirect visual odometry is to find key points (salient points) and extract features from them. The goal is to ensure that our feature extraction algorithm is invariant to changes in lighting and perspective to maximize performance. Using algorithms like ORB, FAST, and SIFT, we can then perform feature matching between two images to find similarities.

2D → 3D

We need to be able to convert from 2D points to 3D points in order to perform accurate projections. We do this by converting an inhomogeneous vector x = [x y] into a homogenous vector x~ = [x y w~]. By converting a 2d point into a 3d point, we’re able to perform all of our operations (rotations, shifts, scaling, translations) through matrix multiplication, therefore simplifying operations and making the overall process easier:

The way we’re able to convert from 3D → 2D points is through the calibration matrix K:

Epipolar Geometry

The goal behind epipolar geometry is to understand the motion in a scene between 2 images:

The idea behind epipolar geometry is that we have some sort of essential matrix that captures the projective geometry on 2 images (i.e. where would x₁ on the second image). If we multiplied x₁ᵗ by F and x₂ (x₁ᵗFx₂ = 0), we’d get 0 as this would mean that x₁ᵗ lies on the same plane as x₂ (note that ᵗ denotes transpose in this instance). We can use non-linear optimization to optimize our R (rotation matrix i.e. orientation of the camera in the world coordinate system) and t (the position of the camera’s center in the world coordinate system):

What we’re doing here is essentially taking the absolute magnitude in the difference of the predicted points x₁ and x₂ in order to minimize reprojection errors. Therefore, we need only 4 images (2 from the left and two from the right).

Note: one of the biggest problems with monocular visual odometry is that we aren’t able to determine global scale (the relative distances from reconstruction into absolute, real-world measurements) because depth information cannot be extracted in a monocular camera.

Direct visual odometry

In direct visual odometry, we completely skip the step of keypoint detection and instead use depth from sensors (RGB-D, LiDAR) to create a 3D map:

If we know the per-pixel depth given the image intensities I₁ and depth map D₁, the image can be simulated from multiple viewpoints.

Simultaneous Localization and Mapping (SLAM)

Note: for this section, I assume you have a basic understanding of what localization is. If you don’t, check out this article that I wrote explaining how localization + particle filters work!

The idea behind feature-based SLAM is to optimize reprojection errors. We do this through an approach known as Bundle Adjustment:

Note that N denotes our camera views and P landmarks that we’re observing. We minimize Π and Xᵥᵥ with respect to all the cameras.

We can also perform loop closure detection, where our camera ends up back at the same location and determines that based on the landmarks in its environment.

Image Reconstruction

Given two stereo images, we can calculate the change in pixel movements (disparity) to create a disparity map, allowing us to study the motion of our vehicle as it moves in its environment. We then would use this disparity map (which can also be thought of as a depth map) to create a 3D structure of our environment:

Source (note that the 160 means the amount of pixels shifted)

For binocular stereo matching, our goal is to create a 2.5D disparity map from 2 images of our scene (static). The overall process would look like the following:

Calibrate both of our cameras and apply this to our images
Create the disparity map
Remove any sort of outliers in our disparity map
Using triangulation, calculate depth
Construct a 3D model

Since our cameras are already calibrated, this would mean that we already know our rotation and translation matrices. Therefore, what we’re solving for are the corresponding points between our left and right cameras. The benefit by knowing our R and t matrices is from the fact that we know the essential matrix, which would mean that we’re able to leverage the epipolar constraint to make correspondence a 1D problem:

When we look at the point x₁ on the left image, the line that x₁ will project onto the right camera (as denoted by l₂). Therefore, finding a correspondence for x₁ would mean that we have to search only on the epipolar line l₂. We do this through a process known as Image Rectification.

The idea behind image rectification is to rewarp our cameras to be perfectly parallel to each other given our calibration matrix K and the essential matrix E:

We can then use a sliding window implementation (block matching) to go through the entire line and select the window with the highest correspondence. We can normalize the feature vectors to account for lighting/illumination changes and then calculate the zero normalized cross-correlation:

To deal with outliers, we’d use a Left-Right Consistency Test where we calculate a disparity map for both the left and right cameras and then verify to see if the disparity pixel value matches each other. For all the pixels that don’t have the same disparity value, you can just delete them:

You can then use basic geometry to determine depth from disparity:

Optical Flow

Optical flow is the idea behind understanding the motion of objects in your scene over time. The reason why we’d want to study optical flow is that understanding the motion of objects (other cars) over time can allow us to predict locations + future trajectories that the cars are going on. Let’s use the Horn-Schunck Optical Flow model to better understand how flow works:

Note that u(x,y) and v(x,y) represent the flow in the x and y direction, respectively. The main idea behind the energy functional is that we displace the current pixel (x,y) by their optical flows (u and v) in the next timestep (think of it as us predicting the motion at timestep t+1) and then subtract that from the intensity at the current timestep. If our optical flow is estimated correctly, then we would be getting an extremely small value (since our intensity i.e. pixel brightness, would stay the same).

We also want to make sure that the flow field to remain smooth without any sort of sudden shifts in the image. By taking the magnitude of the gradient, we’re able to penalize for large variations in the flow, which is then multiplied by regularization parameter λ.

The main problem with the energy functional is that minimizing it is extremely complicated as a result of the non-convex nature of the function (which would mean that we have several local optima). What we can do is instead linearize the brightness constancy assumption (intensity remains the same over time).

We can do this by taking a first-order Taylor series of the intensity function, as shown below:

Plugging that back in gives us this linearized energy functional:

We can now discretize (which essentially means that now we have numerical values for each x,y index of the flow and isn’t a function anymore) the flow maps (u and v → U and V), differentiate wrt. U and V and set the gradient to zero, yielding us a linear system that can be solved:

Deep Learning has made a huge emergence in its applications to Optical Flow. U-Net-like architectures (encoder-decoder) have allowed for accurate predictions of flow in several applications:

Vehicle Dynamics

To control a car, you need to understand how a car moves. That’s exactly what vehicle dynamics is all about.

Since cars can’t move sideways, our velocity will always be in a nonholonomic system, but the cars are grounded by 3 holonomic constraints.

Coordinate Systems

Coordinate systems define vehicle motion. There are generally 3 main coordinate systems used: inertial, vehicle, and horizontal frame motion.

Kinematics of Rigid Bodies

A rigid body is essentially some sort of mass (like a car). The motion of a rigid body is mainly described through a reference point (which could be its center of mass, for example) and the relative motion of all the other points in the rigid body:

There’s always a linear + rotational component when we describe the motion of a rigid body (which is where ω comes from).

Kinematic Bicycle Model

This is one of the simplest models of motion, where we assume that there is no slipping but the wheels are always in the direction of our velocity vector. The kinematic bicycle model assumes that cars work on 2 wheels (like a bicycle) and not 4.

We can then use the vehicle velocity vector as our way to determine our x, y, ψ, and β:

The problem with the bicycle model is that if the steering angle for both front wheels is the same, then you’d be slipping (which is not what we want, and also why those two angles are different). We can use Ackermann Steering Geometry to help with this in smaller angles:

Dynamic Bicycle Model

Since the mass of a rigid body isn’t concentrated in a singular point (center of mass), you need to integrate a density function ρ over the space of the object to get its mass:

Your inertia tensor changes if you assume that the origin is in the rigid body:

Putting this all together:

Path Planning

Once you understand the scene + all the objects around your vehicle, you need to create a trajectory that you can execute on. Path Planning is all about decision-making and creating a plan that you can execute on that helps you get from your current location → destination. We can break this down into a series of simpler steps:

We can create route planning as a graph-based optimization problem where we’re trying to go from point A → B in the shortest distance possible.

Breadth-First Search

Road networks can be represented as directed graphs with vertices and edges, with each edge corresponding to a road/lane and a node representing a landmark/location. Breadth-first search is all about creating the shortest route possible to minimize distance + maximize efficiency when driving.

The overarching idea behind breadth-first is that we can use queues to store paths, a set of all nodes that we processed, and a dictionary to store elements that we can backtrack on once we find a path. The problem with breadth-first search is that that we aren’t able to accurately represent road lengths, traffic, and other exogenous variables. We instead need a weighted graph that can account for this.

Dijkstra’s Shortest Path Algorithm

The main difference between Dijkstra’s algorithm and something like a breadth-first search is the fact that we’re able to store information about the costs of going down a certain path (i.e. weights). We use a min heap-like structure where we go down a certain path until the cost of going down that path exceeds the cost of its adjacent node. We then calculate the trajectory to get to the tail via the adjacent node and then repeat the process:

Dijkstra’s algorithm is guaranteed to find the shortest path. Here’s a visual of what that would look like:

The only problem with Dijkstra’s algorithm is the fact that it’s not computationally scalable to large graphs (cities). What we can do instead is use planning heuristics via eucledian estimates:

A* Algorithm

Exploiting planning heuristics is exactly what A* is all about. We combine Dijakstra’s algorithm and eucledian distances (add them together) to determine the optimal trajectory that we want to follow:

Navigation planners like Google Maps exploit the A* algorithm to find the optimal path that you need to go when travelling. Now that we’ve created a higher-level route as to how our autonomous vehicle can get to its destination, we now need to follow the route:

Behaviour Planning

We want to discretize possible behaviours that our autonomous vehicle would need to follow when driving (ex. red light, overtaking a vehicle, slowing down at a stop sign, etc.). Behaviour Planning is all about understanding how traffic rules and interactions with the ego-vehicle’s environment determine the actions that it must take.

We can discretize by using Finite State Machines (FSM) to account for all possible situations in driving. Think of it as a series of if-statements and determining the best course of actions based on whether those thresholds are met:

Motion Planning

Now that we have some sort of understanding as to the behaviour that needs to be executed from our FSM, we can now calculate a safe trajectory that can be executed. There’s 2 general approaches to the motion planning problem:

Variational methods
Graph-search methods

The idea behind variational methods is to minimize a functional J(π) over our trajectory π, where π describes the path of our vehicle over time. Variational methods are often a non-linear problem, which you’ll have to solve via numerical optimization. Here’s a writeup I did on the Jerk Minimization Trajectory Algorithm if you’re interesting on what an application of variational methods looks like.

Graph-search methods, just like in route planning, discretize the configuration space into a set of nodes + edges, which you’d then apply Dijkstra or A* in order to determine the optimal path that needs to be followed.

Vehicle Control

Control is the way you’re able to bridge the gap between software and hardware. Even if you create a trajectory for a vehicle to follow, you need to be able to break it down into intermediate steps (brake, throttle, steer) and directly interface it with the car. There are 2 approaches to the control problem: open-loop and closed-loop control.

Open-loop control

At every point in time, there might be a certain speed that we want to go at (defined by r(t)). The controller should take that parameter as input, and then output the controlled variable y(t) (ideally approaches r(t) over time) while taking into account environmental noise and other factors that need to be accounted for (z(t)).

Open-loop control isn’t scalable mainly because of the fact that predicting all the noise (disturbances) is impossible in the real world. Any sort of unknown disturbance that occurs in the environment will cause the car to go into chaos.

Closed-loop control

The difference is that we now have sensor measurements of the Process that can then go back into the controller (so that we can compare y(t) and r(t)). Subtracting those two would give us the error which is exactly what the controller is optimizing for.

PID controllers

There are 3 main values that come out of a PID controller:

proportional: a correcting value that’s proportional to the error, multiplied by the proportional gain Kₚ
integral: which integrates the total error over all time steps, multiplied by the integral gain Kᵢ
differential: calculates the derivative of the error with respect to time, multiplied by the differential gain K_d

u(t) is the sum of all these 3 elements together.

The proportional element is the most important element, mainly because of the fact that it’s proportional to the error. The problem with using P alone is that it can lead to overshooting as we’d be crossing e(t) = 0, making e(t) negative and then overshooting in the opposite direction repeatedly.

This is where the idea of adding a differential element comes in. By calculating the rate of error growth with respect to time, we’re able to introduce a damping behaviour, allowing for a smoother acceleration/deceleration.

The integral element corrects for residual errors by looking at changes in our total error, where the desired goal is to keep the error the exact same over time (indicating that we have zero error). This will allow for smoother long-term driving, keeping the error as low as possible.

How do you select the optimal parameters Kₚ, Kᵢ, K_d? The most common method of doing this is the Ziegler-Nichols method:

Model Predictive Control (MPC)

When traveling at high speeds, it’s important to understand the long term so that we can prepare to adapt to our environment (ex. don’t maintain high speeds for too long). This is where Model Predictive Control comes into play.

MPC is a non-linear optimization system that allows us to look into the future and determine how we best want the autonomous vehicle to be controlled:

The goal is to optimize the cost function given parameters δ. Here’s an article I’d recommend to get an in-depth explanation about how MPC works.

Evaluating self-driving cars

It’s important to understand the performance of your autonomous vehicle as it scales with data. There are 2 main ways of evaluation: offline and online systems.

The idea behind online evaluation is that the model interacts with its environment in real-time, where its actions directly impact the state of
the environment that it’s in. In offline learning, you already have a fixed dataset that you’re directly comparing to (where you also know what the ground truths are).