An Iterative Approach to Datasets: The Fuel Behind Successful ADAS/AD

Most Artificial Intelligence (AI) Products are, in essence, supervised learning systems. In the development of automotive products, AI teams are creating new types of learning systems by annotating datasets from multiple sensors such as LiDAR, radar and cameras.

Consequently, proprietary data has become a huge differentiator for the purposes of Advanced Driver Assistance & Autonomous Driving Systems (ADAS/AD). In a market where all competitors have access to the same algorithms, product success depends largely on the quality of datasets that merge sensor data of increasing size and complexity.  

In order to unlock greater efficiencies and significantly improve model performance and safety, auto manufacturers need to understand how to effectively produce and manage datasets composed of complex objects and sequences. The rapidly growing field of sensor hardware requires complex management of this “sensor-fusion” data.

3D semseg

The quality of datasets is crucial for the success of automotive products, but the models are only as good as the data they are trained on. Achieving high-quality datasets is not a one-time task; it requires an iterative process. A developer wouldn’t attempt to write 20 million lines of code all at once, because the objective of that code will continuously evolve. Similarly, datasets that power ADAS/AD products need to be developed and refined over time. Unfortunately, most auto manufacturers still view data as a fixed asset, which can result in suboptimal performance and safety.

High-quality data directly impacts model performance, generalization, bias, robustness, interpretability, and overall efficiency in real-world ADAS/AD applications. To improve safety of autonomous vehicles, this data needs to be fine-tuned and integrated with human recommendations on what can be improved, addressing questions such as: What objects are throwing off performance? Other cars? Pedestrians? Reflections? Stationary objects?  Addressing these questions is the right approach for automotive manufacturers who want to efficiently train their ML models with sensor-fusion data.  

Optimizing the Feedback Loops

Once the dataset is deployed into training, the feedback loops between dataset assessment and model performance contribute to an iterative process that is incredibly valuable to ADAS/AD product success. Specific software that reveals anomalies and improves data, or adds new data where needed, can enhance the model’s capabilities as is verified. Taking an iterative approach to datasets also helps prevent model bias, as data augmentation techniques can help the model differentiate between objects more accurately. 

As an example, let’s consider pictures of pedestrians on billboards. While it might be obvious to the human observer that these are not real pedestrians, it may not be clear to the “machine.” Humans need to determine how to best handle this type of situation, in order to eliminate ambiguity. One possible solution would be to not annotate those objects and let the model detect them as pedestrians. Another might be to add annotations in your dataset, to avoid “punishing" your model for detecting them in the first place. Either way, these types of situations  demand that an iterative cycle be enabled on your dataset. 

Platform view pedestrians

When it comes to overall sizing, increasing dataset size only where it has a positive impact (not adding more of data you already have) offers more diverse examples for better model training. Many OEMs and Tier 1/Tier 2 manufacturers still capture kilometer after kilometer of highway data, but when the bulk of this coverage doesn’t contain the rare occurrence that will improve the ML model, we can see that greater size does not equate to better results.

Development, Production and Beyond

Graph Datasets process

Here are three tips for AI product teams that want to improve the success of their ML model by implementing iterative datasets during development, production and beyond:

  1. Configure your tools and set your guidelines well in advance. The journey towards creating high-quality datasets begins well before production: tool configuration and guideline setting are essential for minimizing ambiguity that can contribute to early (and costly) errors. For example, restricting cuboid orientation, unless explicitly needed, can reduce mistakes. Annotator agreement tests, where multiple annotators label the same data, helps create metrics on guideline thresholds, often visualized with heatmaps. And adding automatic sanity “checkers” can detect anomalies and generate warnings and errors before submission, ensuring annotation quality.
  2. Implement continuous quality control during production.  Sampling tasks for quality assurance allow for statistical quality estimates, enabling early issue detection and cost reduction. Aiding data selection, monitoring data coverage provides a summary of the dataset’s contents. Then, by tracking user errors and reviewing rejections throughout production, dataset quality can stay at a high bar.
  3. Monitor workforce performance.  In any annotation effort, some annotators leave while new ones are onboarded.  We also typically find issues in sensor calibration that affect how to annotate different edge cases. Continuous performance monitoring can help to identify and correct potential problems before they take hold.

Gaining an iterative understanding of machine learning datasets is key to success in AI, especially in ADAS/AD products. By focusing on dataset quality through continuous improvement, organizations that invest in high-quality data and iterative collection, annotation and verification will win the race to deliver impactful results in diverse real-world scenarios.