On September 22nd, Tom presented his session "Scrap that RfQ – Why your data sourcing strategy is killing your ML development" at AutoAI Europe in Berlin, inviting the audience to re-think the role of data as an asset.
In the race to develop autonomous vehicles, one of the most common misconceptions is that more and better data automatically leads to better machine learning (ML) models. While it’s not wrong to think that bigger volumes and higher quality yield better performance, the truth is far more nuanced. Data isn't an asset that continues to yield linear improvement with scale —it’s a dynamic, evolving resource which requires continuous refinement. Relying on the traditional automotive recipe of ‘large volumes, low unit prices’ to source data can lead to a detrimental lock-in of your ML development. Let’s explore why.
Developing ML-based systems is not a linear endeavor. Models built today need to adapt to tomorrow’s changes —changes to requirements, conditions and best practices. This means that data collection and processing isn’t a one-time activity, but a flexible process. In autonomous vehicle development, for instance, your reality is continually changing: New objects come into existence, new edge cases are discovered, a supplier may discontinue a camera type, vehicle design might force a new LiDAR placement… The world is ever changing; and datasets need to keep up alongside model development.
So ‘more data’ is not a winning formula - even though the overall volumes required are of course large. But just as software does not automatically improve if you add 10 million lines of code, ML performance will not necessarily get better with millions more data labels — continuous refinement of requirements, collection campaigns, processing guidelines and quality KPIs are required. Without such iteration, the half-life of your datasets’ value quickly decreases.
At this point, it is necessary to also reflect on the value of data quality: It is easy to believe that ‘high quality data’ will fix the shortcomings of a volume-based bet. However, data quality faces the same challenge of complexity: 100% quality is impossible, and 99.999…% becomes infinitely expensive — plus, it could actually be detrimental, because your model might fail to handle noise if trained only on pixel-perfect annotations. Additionally, quality is defined alongside more axes than just the fit of bounding boxes. The reality is that quality, too, is complex. You won’t know your exact needs on day 0, and they will change over time: What moved the needle for your model performance yesterday will eventually have diminishing returns — and you risk burning a lot of money on ‘high quality data’ even when a particular KPI has lost value for ongoing development.
The assumption that ‘more data’ or ‘better data’ translate to better performance is not wrong in itself — but traditional data sourcing overlooks the critical fact that ‘better’ cannot be defined 2 years upfront for non-linear development efforts. If data eats models for breakfast, then continuous iteration eats volume for lunch.
Automotive procurement loves low unit prices and highly predictable cost. Based on what has served OEMs well over the last decades, RfQs focus on locking in fixed conditions over a long period of time. When it comes to buying data, a lot of requirements and quality KPIs can be thought up and documented as part of a supplier contract — but reality shows that these assumptions do not hold. As a result, automakers may save a 1 - 2 cents per annotation, but lose big bucks on their overall development effort. It is highly important to keep in mind that datasets for model training and validation are not 'the thing' that we are after. The 'thing' is a model that meets performance KPIs and can be deployed on the road within time and budget.
So, how do you adapt? The answer lies in flexibility. Instead of committing to large, rigid datasets, your sourcing strategy should be more agile. When picking a provider, the need to meet changes in data/volume/quality KPIs should be valued over the lowest-possible unit cost; just as one would not pick a software vendor based on the price per line of code. Equally important is the need to invest in infrastructure, processes and tools that allow for iterative data collection, processing, and refinement: Your data strategy needs to be fully integrated with your model development, rather than a standalone upstream effort. This approach ensures you will have access to the right data when changes inevitably happen.
As the landscape of autonomous vehicles and machine learning continues to evolve, so must your approach to data. The most successful ML models will come not from the teams that process the most data, but from those that continuously refine and iterate on relevant datasets. The days of linear data sourcing are over for anyone looking to deliver ADAS/AD models within budget. To stay ahead, we need to embrace the fact that datasets are a dynamic, evolving resource, and future-proof our strategy accordingly — just as we dropped waterfall in traditional software development.