Can't we use ChatGPT, or any of its cousins, for solving autonomous driving?
ChatGPT has celebrated its one year anniversary, and the AI landscape is changing with an amazing speed. It is probably fair to say that the bleeding edge machine learning (ML) models (such as the widely popular large language models like GPT4, a further development of chatGPT, as well as more niche computer vision models like SAM and DINOv2) has not yet reached its full potential in the autonomous driving (AD) field. In this article, our experts will reflect on the gap we still think exists between (a) what those ML models are capable of, and (b) what is actually needed to solve AD.
Put differently: “Can’t we just put GPT-4 into a car, and have our self-driving vehicle done? Looking at this, It seems to understand traffic well!”
Prompt to GPT-4:
Can you provide a detailed description of the image that could be used by an autonomous vehicle to safely navigate that road scenario? Please, focus on the safety relevant aspects.
Response:
The path shown by the recent ML advantages is indeed truly impressive. And it may very well lead the way to a full AD solution eventually. However we are not there yet; there are some quite important technical aspects that are just… well, different for the AD use case compared to how ChatGPT works. While none of them seems fundamentally impossible to solve, there is at the very least quite some work left to bridge the gap. Let’s look at some of the aspects of this gap that we believe are most crucial:
ChatGPT has been trained on an insane amount of text data, and is fluent in text. The latest version, GPT4, can also interact with images. And there are also several recent niche models for computer visions that work with different aspects of understanding the image content.
For AD, images are of course important. And we have a lot of them. Both because there are many cameras (typically looking in different, partially overlapping, directions) and because each camera takes many pictures every second. Fully understanding the surroundings of the car requires more than just the latest image. To understand whether another vehicle is moving or not, the bare minimum is to consider two subsequent images to tell whether it has shifted its position or not. In practice much more than two images are needed to get a reliable understanding of the dynamics, not least because pedestrians can temporarily be hidden by passing vehicles. As a small example, let us assume we use a 5 second sliding window approach (that is, only remember the images from the last 5 seconds) and take 10 images per seconds with 4 different cameras. That gives 200 images to process. And those 200 images must be considered jointly, both with respect to their exact timestamp as well as where they are mounted on the car relative to each other. To the best of our knowledge, the sheer amount of connected images that needs to be interpreted together is still a showstopper for those models.
It is widely believed that cameras alone will not be sufficient for a safe AD. It will likely have to be complemented with lidar, and possibly radar, since the 3D point cloud obtained from these technologies have an unmatched ability to understand distances to other objects, compared to plain 2D camera images.
A lidar point cloud visualized in the Kognic app. The plint cloud is a 3D mapping of the world around the car
Although there are research steps taken to apply advanced ML models to lidar point clouds (just to mention a few: LidarCLIP, PointCLIP as well as a recent master thesis at Kognic), it seems there are still much left work to do until we have a ML model that can explain a point cloud with the same accuracy as we have seen for images.
The true power of the lidar is, however, probably not when used and interpreted on its own. Instead, it is the sensor fusion between lidars and cameras that seems to become really powerful - the granularity of the camera pixels together with the depth information from the lidar points can allow a very accurate understanding of the environment.
Beyond the sheer amount of additional data when fusing camera and lidar information, there are more challenges. First of all, the model has to be multi-modal in the sense it must handle different types of data (images and point clouds). Luckily, multi-modal has recently become a huge thing, and many recent ML models are multi-modal between text and images. Now, we're just lacking a third modal, the pointcloud. But there are some details to this fusion that the model just has to, in one way or another, get right in order to be safe for AD: While all pixels in the camera image are recorded at the same point in time, a lidar point cloud is instead collected one point at a time by sweeping over the world following some scanning pattern (which can differ between different lidar versions). The net effect is that the timestamp for each lidar point is different, and practically none of them match exactly the timestamp of the camera image. It may seem as an overly technical detail, but when both you and the approaching vehicle are moving at 100 km/h, being a few milliseconds of when interpreting its trajectory might lead to catastrophic results… And, mounting more lidar sensors on the car won’t make the fusion less complicated.
None of these aspects seems fundamentally impossible to solve for an advanced ML model. But, to the best of our understanding, there is quite some work to be done until an ML model handles sensor fusion between multiple camera image and lidar point cloud sequences with the same excellence as ChatGPT handles text.
Things can happen fast in traffic. To get some ballpark numbers, let’s consider this scenario as defined by EuroNCAP, where a child is running out in the street from behind some parked cars according to the picture below.
From https://www.euroncap.com/media/79884/euro-ncap-aeb-lss-vru-test-protocol-v45.pdf
The scenario is designed such that if there is no braking by the car under test, the pedestrian will be hit in the center of it. The speed of the pedestrian is 5 kph (1.4 m/s). The pedestrian is a child and very small parts of it are visible above the hood of the parked car. This gives that time from when the pedestrian is fully clear of the parked car, until the collision would occur (if no braking is applied) would be about 1.5 seconds.
Since the laws of physics are non-negotiable, most of those 1.5 seconds have to be spent on braking and/or steering, whereas the detection and decision has to happen within fractions of a second. Every time. And even though ML models may seem relatively fast - the processing of a single image for the SAM model is in the order of tenths of a second on a high-end GPU (assuming you can afford to put that into a car) - the margins are quickly gone if you need to, for example, first process several such images in series before you can proceed to simulating and compare future trajectories etc. It is probably not sufficient for a ML model to be relatively fast, it needs to be blazingly fast (let’s not even mention the responsiveness of ChatGPT).
There are no reasons to not expect that computers will be faster in the future. There are also developments in the ML models themselves that make them faster. As long as those improvements are not eaten by added complexity of the ML models, this might not be an issue in the future. But we are not there yet.
For most of the time, the response from ChatGPT will not cause physical harm to anyone. For a model powering an AD vehicle, the situation is the opposite - most of the time a misleading response can lead to material damage (best case) or a painful death (worst case).
AD is indeed not the only application of ML where lives are at stake. Medical applications are another such example. However, many medical applications are decision support - like screening tons of X-rays. To introduce some numbers, let’s say there is 1 severe medical condition among 1000 X-rays. If we cannot trust that the ML-model will rank that specific X-ray as the most critical one, but we can only know for sure that it will be in the “top-100 most critical X-rays”, it is still not useless. In fact it reduces the need for manual inspection from 1000 to 100 X-rays, that is, a reduction by 90%. Here, a missed detection is fatal, whereas a false detection is mostly an inconvenience.
Historically the decision support in automotive have had the other perspective: since features like auto-brake doesn’t take any responsibility from the driver, and the baseline has been “compared to a manual driver”, a false intervention (like, auto-braking because of a the car hallucinating a pedestrian) has been considered more severe than a missing intervention (since it still was always the responsibility of the human driver).
With full autonomy, the picture is different. AD cannot really afford neither missed nor false detections/interventions. In any situation. Achieving that level of performance uniformly throughout all different weather and traffic situations is a real challenge, also for the current state of the art ML.
Unfortunately, the recent ML advances seem not to be a silver bullet for AD. At least not yet. Beyond the (kind of) hard facts that we have discussed above, one could also raise concerns about how transparency, explainability and accountability degrades the larger ML models there are. That is, however, the topic for another discussion. On the other hand, there is something fundamental that at least none of us did anticipate a few years ago - the impressive ability a large language model has when it comes to human-like reasoning. One of the main challenges with AD is the tail of unexpected events that it has to handle - like, when a fighter jet is making an emergency landing on the opposite highway lane at the same time as you are being overtaken on the refusal. Maybe a future cousin of ChatGPT can come in handy here, to make sure we can handle an unforeseen situation well beyond the intended operational domain in a reasonably safe way?