Download PDF
Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’
Technology Category
- Analytics & Modeling - Machine Learning
Applicable Industries
- Automotive
Applicable Functions
- Product Research & Development
- Discrete Manufacturing
Use Cases
- Machine Condition Monitoring
- Autonomous Transport Systems
Services
- Cloud Planning, Design & Implementation Services
- Data Science Services
The Challenge
Wayve, a London-based company developing artificial intelligence software for self-driving cars, was facing a significant challenge with their GPU resources. Their Fleet Learning Loop, a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet, was consuming a large amount of GPU resources. However, despite nearly 100 percent of GPU resources being allocated to researchers, less than 45 percent of resources were utilized. This was due to the fact that GPUs were statically assigned to researchers, meaning when researchers were not using their assigned GPUs others could not access them. This created the illusion that GPUs for model training were at capacity even as many GPUs sat idle.
About The Customer
Wayve is a London-based company that is developing artificial intelligence software for self-driving cars. The company's unique approach to autonomous driving technology does not rely on expensive sensing equipment. Instead, Wayve focuses on developing greater intelligence for better autonomous driving in dense urban areas. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. They train the product baseline with the full dataset over many epochs, and continually re-train as they collect new data through iterations of the fleet learning loop.
The Solution
Wayve turned to Run:ai for a solution to their GPU resource and scheduling issues. Run:ai implemented a system that removed silos and eliminated static allocation of resources. They created pools of shared GPUs, allowing teams to access more GPUs, run more workloads, and increase productivity. Jobs are submitted to the system by Wayve researchers every day, regardless of team, and jobs are queued and launched automatically by the Run:ai system when GPUs become available. Run:ai’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic preemption, multi-node training, and more. This resulted in efficient cluster utilization of over 80% and a significant increase in the number of jobs running.
Operational Impact
Quantitative Benefit
Related Case Studies.
Case Study
Integral Plant Maintenance
Mercedes-Benz and his partner GAZ chose Siemens to be its maintenance partner at a new engine plant in Yaroslavl, Russia. The new plant offers a capacity to manufacture diesel engines for the Russian market, for locally produced Sprinter Classic. In addition to engines for the local market, the Yaroslavl plant will also produce spare parts. Mercedes-Benz Russia and his partner needed a service partner in order to ensure the operation of these lines in a maintenance partnership arrangement. The challenges included coordinating the entire maintenance management operation, in particular inspections, corrective and predictive maintenance activities, and the optimizing spare parts management. Siemens developed a customized maintenance solution that includes all electronic and mechanical maintenance activities (Integral Plant Maintenance).
Case Study
Monitoring of Pressure Pumps in Automotive Industry
A large German/American producer of auto parts uses high-pressure pumps to deburr machined parts as a part of its production and quality check process. They decided to monitor these pumps to make sure they work properly and that they can see any indications leading to a potential failure before it affects their process.