The state of

Sri Anumakonda
15 min readSep 13, 2023

-- 7 years ago!

“Give me $1,250 and I can make your car drive on its own.”

It’s almost crazy to think that in a world where we’ve spent more than $100 billion in the self-driving industry, we haven’t even truly solved the problem of self-driving cars. When will we get self-driving cars? How far have we even gotten from the DARPA days?

comma has been a company that’s stood out to me because of the amount of progress that they’ve made with some of the most limited resources compared to the overall industry. I’ve spent the past little bit understanding what the fascination with end2end learning + solving self-driving with cameras is and why this will be the way we solve self-driving.

My intention with this blog is to dive into the software of through this year’s COMMA_CON talks (end2end learning) since I am a computer vision geek, and to figure out where comma really is today + how far they are from “solving” self-driving. I’m going to focus on the two main computer vision talks focusing on the primary ML stack (Harald’s and Yassine’s) with my commentary and then provide my own thoughts with respect to how far comma is from solving the “self-driving problem.”

The openpilot tldr;

openpilot is an advanced driver assistance system (ADAS), which essentially is a level 2 driving system (out of the possible 5). It’s essentially the software that makes up the comma3 (and the 3x), which turns your car into a self-driving car.

But if you wanted to actually control the car, how would that work? You can use a CAN BUS (Controller Area Network). The CAN BUS has two parts to it: an identifier and a data portion, which helps provide all the data you need to collect data (lateral + longitudinal data) + be able to hack into your car and make it drive on its own!

an example of a message sent in the bus! source.

The CAN-TO-USB interface (a.k.a, the panda) is what allows openpilot to send all the driving information into the car (which connects to the car via the OBD-II PORT).


The raw car data is then processed with the comma cabana in order to figure out what parts of the raw data are relevant to driving + which ones are not. From there, it’s simply a matter of figuring out what are the data pieces that need to be sent to the car + sending the control messages. panda ensures that the data being sent in the car is safe (and won’t make any accidental sharp turns/acceleration).

the higher level openpilot architecture. source.

For those curious to learn more about the technical specifics behind porting a car, Jason’s talk at COMMA_CON does a great job explaining it.

The meat of driving: end2end learning

Great! Now you know about how comma works from an overarching level, but how does it learn to drive? The answer is through end2end learning:

old blog but still relevant! source.

The main idea behind end2end learning is that you can take in a raw image of the driving scene and be able to achieve lateral + longitudinal driving with the use of Convolutional Neural Networks. The question end2end learning essentially tries to answer is, “what if i didn’t create human intermediaries (semantic segmentation, detections, etc.) and allowed a model to learn driving on its own?”

Why would we want to do that? Because of the fact that creating intermediaries restricts the model’s ability to learn mappings of its own. If you want to create a self-driving car that’s robust to any scenario, you need to allow it to learn its own mappings + patterns (that humans might not have even thought of!).

You can get results so good that you don’t even need to tell your model to detect traffic lights (it just knows!):

One of the questions that popped into my mind was, “okay, great. we have a car that’s able to drive given an input image. How do we actually use that information to go from point A → point B?”

The model knows where to drive by leveraging the built-in navigation systems via MapBox GL. It’s then processed in 60-second videos to train this model.


To render the maps, you need latitude + longitude for each map (done with the comma e2e model) and the route (the trickier part — why? because GPS is noisy, which they solve with Valhalla).

An autoencoder is then trained to ensure that the bottleneck layer (the feature vector) can be used as a way to compress the map data (since 256x256 is A LOT of pixels). Boom! You now have a car that can drive from Point A → Point B:

orange = predicted path, blue = ground truth. tested in simulation (which we’ll go over soon!). source.

comma also focuses on driving monitoring + ensuring that the driver is paying attention at all times. This is done through the use of more end2end models. You can also use this information to predict uncertainty + figure out at what points in time the model is not confident in its control of the car (which can help drivers better understand when to disengage from openpilot and take over).

p.s. if you’re curious about what car customization looks like + ensuring that openpilot works on your car, look no further than vivek’s talk!

Driving to Taco Bell with end-to-end ML

What else would be a better way to start COMMA_CON by listening to Harald and the end2end research that he’s doing?

The constraints that they’re using to deploy self-driving systems:

  1. fully end2end models. why? because it “gives more flexibility to the ML models to learn behaviors and patterns itself.” The key idea with end2end learning is that there is no explicit human control as to what underlying semantic information you want the car to see (ex. lane lines, pedestrians, semantic segmentation, etc.). Instead, end2end models implicitly allow the neural networks to learn all this spatial information without being “constrained” (perception layer).
  2. scaling with compute. this one i believe is one of the more important factors when it comes to ML deployment. it’s important that you can a) create a policy that has good performance when it comes to driving and b) it’s easy to scale across multiple mediums (cities) with minimal difficulties
  3. good loss functions to ensure that you have a model that understands what to optimize for + ensure that loss function “reflects good driving.”
  4. diversity of data, which is something that you get from users and their driving logs. one thing that i would add here though is that there is a difference between gathering data vs. gathering HIGH-QUALITY data. If you want to create a “superhuman driving agent” then you need to either at least a) create/capture data that your car has never seen before that it will most likely see in the real world or b) create a policy so good that it doesn’t need any prior information on how to deal with out-of-distribution (OOD) events.

He breaks’s end2end approach into 3 main pipelines:

  1. end2end lateral (steering/yaw rates) control
  2. end2end longitudinal (position/speed/acceleration) control
  3. end2end navigation (so that the car knows where it needs to go!)

how do they prove this? by showing a driving log to tacobell 🌮

One of the initial problems Harald brings up with the warped simulation that comma’s been using is that the cars + objects were warped sideways. how do they solve this? use a segnet to make depth assumptions + basic objects!


Using sem-seg information, we can create a basic depth map and be able to pair that information to create a better simulation. wrote a blog a while back on their end2end lateral control model. 2 years later, the ideas have stayed the same but there’s been major updates (such as getting rid of the “student-teacher approaches” + the multiple architectures used) to ensure max efficiency and performance.


A fascinating idea that they rejected this year is the usage of stoplines to provide information to the e2e model as to when + where to stop:


Because of the sheer amount of variation that comes from user data (ex. sudden breaking/breaking too early), it’s hard to create these models that are 100% dependent on your data. I think, potentially, there is room for an algorithmic implementation of these types of stop lines (ex. gather scene information such as traffic lines + stop signs and create a model using those parameters + the image to create the stopline), but I definitely agree with the fact that leveraging users’ data for this task was not going to give good results.

Idea #2 that they rejected was the use of Depthnets:


Depthnets are noisy + hard to predict + has lots of failure points (shadows, traffic lights, etc.). Performing these types of tasks without multiple camera systems is extremely difficult as you’re dependent on a singular camera to be able to solve this problem.

What they’re currently working on:

  1. switch from physics-based-simulation to ML/end2end based (have a policy learn full simulation and get rid of any “human”/classical intervention)
  2. using Reinforcement Learning (RL) based control methods.’s 2021 blog mentions that they do use a Model Predictive Control (MPC) solver to ensure smooth execution of driving. instead of following an algorithmic approach, use RL as a way to improve the model with respect to data + time.

Learning a Driving Simulator

One of the first simulators built back in 2016 was a “small offset simulator.” The idea here is essentially to “offset” the camera slightly to the left + right to see if the end2end policy is able to recover from the shifts.

Yassine breaks down the ML Simulator architecture into 3 main components: the pose tokenizer, the image tokenizer, and the dynamics transformer.


Let’s dive into this together:

Image tokenizer


The idea is straightforward: take the image, compress it into a tokenizer (encoded vector). Using that, we can then upsample with the decoder and validate the “realness” of the image via the discriminator.

The tokenizer essentially follows a VQ-GAN structure. It’s similar to a GAN (read more here) . I won’t dive deep into the VQ-GAN architecture and its inner workings since this blog perfectly explains it.

Their results?


As you can tell, the bit resolution is directly correlated with the quality of the VQ-GAN reconstruction (the lower the bits, the worse it gets).

Pose tokenizer

The main idea behind a pose tokenizer is to take in the pose data (x, y, z, yaw, pitch, roll) and be able to tokenize that data in such a way that you can create a set of tokens (continuous data → discrete).

The comma team does it in a quite straightforward manner — they digitalize this information with uniform binning of the continuous data:


Dynamics Transformer

They then use a transformer to predict the next token (of the information that we were encoding above) which is then rolled for the next n amount of frames (which is known as autoregressive sampling). These are the results Yassine shows:

demo. the coolest thing? everything is open-sourced! source.

One of the initial thoughts that I had when Yassine showed how the dynamics transformer is the fact that spatiotemporal information can be leveraged here in order to create a better, higher-quality model (and move towards video data v.s. feeding in outputs → inputs for a certain amount of time). What he brings up, though, is this interesting idea where you essentially have a recurrent layer placed in the decoder (with everything else frozen) where the decoder can learn the disparities from multiple input images and be able to improve the quality of the simulated prediction. Here’s the demo he shows in his talk (it’s a really interesting idea!).

With respect to moving, my assumption (and Yassine confirms this in the Q&A) is that they most likely are able to control what type of simulated output they want (ex. brake, acceleration, moving right/left, etc.) by controlling the pose-estimation tokens (in such a way that your values make the dynamics tokenizer to predict the next frame as going in x direction).

The biggest room for potential would potentially be creating some optimized pose tokens where you can essentially figure out where your model lacks (ex. could be with lane changing to the right) and gear simulation towards there. Then you can a) finetune your simulation and automate the pose tokenizer process but also b) create high-quality data that can dramatically improve model performance.

I think the simulation stack here is quite interesting. The majority of the problem will most likely be centered around ensuring that a) you have a proper dynamics transformer architecture where you can create videos (and leverage temporal information) while also ensuring that b) you have high quality simulation.

What does it take to solve self-driving?

George says that self-driving is already solved and that we are a couple “bugs” away from solving the overall driving problem. Is that true?

The way George defines the self-driving problem is: “how do you build a model that outputs a human policy for driving?”

He brings up a really good point on Lex Fridman’s podcast about how, fundamentally, you can create a basic policy that does the job (drives like a human). From there, it’s just a matter of fixing the “bugs” + scaling; scaling in compute + scaling in data. I’ll focus more on the ladder since the former is quite straightforward to understand.

Udacity’s Self-Driving Car Dataset and their corresponding steering ground truth’s between [-1, 1] radians. source.

The majority of data that you get will likely be straightforward, basic driving (i.e. staying in your lane + following basic traffic rules). The fundamental definition of a generalizable self-driving agent is to be able to navigate under all sorts of conditions and react to every scenario you face. When the majority of the data you collect isn’t diverse (ex. the steering values of the Udacity dataset above), then your model will not be able to react accurately in situations. There is no second chance when it comes to self-driving cars: a single mistake is life or death, and failure to prepare for every scenario is a recipe for disaster.

My two cents when it comes to creating generalizable self-driving is that you need to have the right combination of the following things in order to create a framework to “solve self-driving”:

  • compute (which George talks about)
  • a strong enough self-driving policy (which is bounded by the quality of data + model architecture. my contrarian take is that we already have good enough models but not good enough data)
  • going from simulation (which what comma is building) to high-quality simulation where you can improve your model + sample high-quality data that can exponentially improve performance
  • interpretability of these models (from both a regulatory perspective + potential avenues in debugging + training these models)

Let’s dive a little bit deeper into the ladder 2 buckets:

What ideal simulation would look like

Simulation is what enables self-driving to scale. It allows you to not only scale faster but also be able to improve model performance if used correctly. Some of the characteristics that a high-quality, robust simulation should have include the following:

  1. Being able to create simulations of all sorts in video to allow for training + deployment across multiple environments (snow, rain, fog, etc.)
  2. Ability to create challenging scenarios that the existing model might not have exposure to in certain geographies (ex. snow in Texas). The second part to this would be to use this data and improve the existing model.
  3. Building on from the existing point, be able to simulate out-of-distribution/extreme edge case scenarios where there is high likelihood of failure and ensure that your model can deal with these types of situations.

I think Wayve’s research with GAIA-1 is a really good example of what a powerful simulation with pure computer vision looks like. Leveraging this in such a way that allows you to be able to improve + validate existing models can result in really big breakthroughs in the end2end learning space.

Interpretable end2end learning

I expand a little more about this in my talk @the Austin Computer Vision Meetup, but the rundown essentially is that understanding the how + why your model makes the decisions it does can play a really big role in:

  1. Debugging failure cases/crashes and understanding why the model made the decisions that it did
  2. Leveraging interpretability in simulation + improving training
  3. Potentially finding outliers in data that have a big impact on model performance

From a regulatory standpoint, it’s important [to some degree] that you have some ability to peek into this black box and understand why your model made the decisions it did. But I do think that there is a good amount of information that can be learned through the ability to peek into this black box.

Understanding potential failure cases of your model allows you to iterate and leverage simulation to improve these models. For example, if my comma 3 struggled in snow, then understanding what the model can see via saliency maps (as an example) allows us to understand if there are possibly any characteristics of snow that it struggles with. Using that information, we can then restimulate these environments to improve these models.


So, have we solved self-driving?

The short answer is yes and no.

The way we get to solve self-driving cars is clear (and has been clear for the past couple of years), but it will take time until the problem truly gets solved. One of the questions I like to ask myself is, “what does it even mean to solve self-driving?” The world, in nature, is not stochastic, which means that there will always be situations encountered that your model has not been exposed to in its dataset.

How do you create a model that can adapt + deal with any situation that it’s put in? Will simulation allow you to skip the “real-world testing” and be able to deploy these models straight from simulation? How powerful of a simulator do we need in order to solve self-driving cars? To what degree do we need to worry about understanding these models? Does it even matter if we can understand these models as long as the model can drive with good performance? How good of a model is needed? Are we trying to replicate human driving (and, to be fair, humans are excellent drivers), or are we trying to create a self-driving car where we want superhuman performance? How do you even regulate self-driving cars?

These are some of the questions that we should be thinking about when it comes to creating self-driving cars. For me, is one of the most inspiring companies right now in the self-driving space because of how fascinating the research that they’re doing is + how much progress they’ve made in the space with relatively limited resources [compared to most companies]. I really do believe that the “comma approach” has the potential to really disrupt the self-driving/robotics industry as a whole and that this will be the way we solve self-driving cars.

Thanks for reading this article! My goal with this article is to teach more people about how cool self-driving is through one of the self-driving companies that I truly believe will solve this problem + push the frontiers of humanity. It would mean a lot if you shared this resource wherever you can! I spent a lot of time curating notes + creating a good understanding for people to learn RL from, and my goal is to help as many people as I can!

To learn more about me and the research that I’m doing in self-driving cars, here’s my twitter + linkedin + website. My research primarily focuses on leveraging computer vision to train better autonomous vehicle policies while ensuring that we create self-driving systems that are able to scale. Right now, I’m currently doing research at UIUC where I’m looking into using generative modelling techniques to create better simulation + training for self-driving cars.