Get the Most Annotated Data for Your Budget – Why Traditional Data Sourcing Is Holding Back Your ML Development
On September 22nd, Tom presented his session "Scrap that RfQ – Why your data sourcing strategy is killing your ML development" at AutoAI Europe in Berlin, inviting the audience to re-think how they source and process autonomy data.
Get the Most Annotated Data for Your Budget – Why Traditional Data Sourcing Is Holding Back Your ML Development
In the race to develop autonomous vehicles, many teams assume that more data and better quality automatically lead to better machine learning models. While volume and quality matter, this overlooks a critical reality: ML development is non-linear, and your data needs evolve constantly. Traditional automotive procurement—focused on locking in large volumes at the lowest unit price—creates a detrimental lock-in that prevents teams from adapting. The result? You might save pennies per annotation, but lose significant time and budget on your overall development effort.
Data Is Dynamic: Iteration Beats Volume
Developing ML-based systems requires continuous adaptation. Models built today need to respond to tomorrow's changes—new requirements, edge cases, sensor configurations, and road conditions. In autonomous vehicle development, your reality is constantly shifting: new objects appear, suppliers discontinue camera types, vehicle designs force new sensor placements. Datasets must evolve alongside model development, not remain static assets purchased upfront.
Simply adding more data doesn't guarantee improvement—just as adding millions of lines of code doesn't automatically improve software. What drives ML performance is continuous refinement of requirements, collection campaigns, processing guidelines, and quality KPIs. Without iteration, the value of your datasets quickly diminishes. If data eats models for breakfast, then continuous iteration eats volume for lunch.
Quality Is Complex, Not Absolute
It's tempting to believe that "high quality data" solves the limitations of volume-based approaches. However, quality faces the same challenge: 100% quality is impossible, and chasing 99.999% becomes infinitely expensive. Worse, pixel-perfect annotations might actually hurt your model's ability to handle real-world noise.
Quality is multi-dimensional—it's not just about bounding box accuracy. Your quality needs on day zero will differ from what matters six months later. What improved model performance yesterday may have diminishing returns today. The assumption that "more" or "better" data translates to better performance isn't wrong—but traditional sourcing overlooks that "better" cannot be defined two years upfront for non-linear development.
Why Traditional Procurement Fails for ML Data
Automotive procurement prioritizes low unit prices and predictable costs over long periods. Based on decades of manufacturing success, RfQs lock in fixed conditions through detailed requirements and quality KPIs. But for ML development, these assumptions don't hold. Teams end up with rigid datasets that can't adapt to evolving needs, ultimately increasing total development costs and time-to-deployment.
It's crucial to remember: datasets for training and validation aren't the end goal. The goal is a model that meets performance KPIs and can be deployed on the road within time and budget. The real deliverable is annotated data that accelerates machine learning—not just the cheapest labels per unit.
A Smarter Sourcing Strategy: Flexibility and Integration
The answer lies in flexibility and integration. Instead of committing to large, rigid contracts, your sourcing strategy should be agile. When selecting an annotation provider, prioritize their ability to adapt to changes in data volume and quality requirements over the lowest possible unit cost. You wouldn't choose a software vendor based on price per line of code—the same principle applies to annotation.
Equally important is investing in infrastructure, processes, and tools that enable iterative data collection, processing, and refinement. Your data strategy must be fully integrated with model development, not treated as a separate upstream task. This approach ensures you get the most annotated data for your budget—because you're continuously refining what "useful" means as your models evolve.
Conclusion: Machines Learn Faster with Human Feedback
As autonomous vehicle development accelerates, your approach to data must evolve with it. The most successful ML models won't come from teams that process the most data, but from those that continuously refine and iterate on relevant datasets with cost-efficient human feedback integrated throughout their development pipeline.
The days of linear, volume-based data sourcing are over for teams serious about delivering ADAS/AD models within budget. To stay ahead, embrace data as a dynamic, evolving resource—and choose partners who can deliver the most annotated autonomy data per dollar, adapting alongside your needs. Just as we dropped waterfall in traditional software development, it's time to bring that same iterative mindset to how we source and process autonomy data.
Share this
Written by