Download PDF

Run:AI > Case Studies > Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Technology Category

Analytics & Modeling - Machine Learning
Application Infrastructure & Middleware - Data Exchange & Integration

Applicable Industries

Automotive

Applicable Functions

Discrete Manufacturing
Product Research & Development

Use Cases

Autonomous Transport Systems
Machine Condition Monitoring

Services

Cloud Planning, Design & Implementation Services
Data Science Services

The Challenge

Wayve, a London-based company developing artificial intelligence software for self-driving cars, was facing a significant challenge with their GPU resources. Their Fleet Learning Loop, a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet, was consuming a large amount of GPU resources. However, despite nearly 100 percent of GPU resources being allocated to researchers, less than 45 percent of resources were utilized. This was due to the fact that GPUs were statically assigned to researchers, meaning when researchers were not using their assigned GPUs others could not access them. This created the illusion that GPUs for model training were at capacity even as many GPUs sat idle.

About The Customer

Wayve is a London-based company that is developing artificial intelligence software for self-driving cars. The company's unique approach to autonomous driving technology does not rely on expensive sensing equipment. Instead, Wayve focuses on developing greater intelligence for better autonomous driving in dense urban areas. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. They train the product baseline with the full dataset over many epochs, and continually re-train as they collect new data through iterations of the fleet learning loop.

The Solution

Wayve turned to Run:ai for a solution to their GPU resource and scheduling issues. Run:ai implemented a system that removed silos and eliminated static allocation of resources. They created pools of shared GPUs, allowing teams to access more GPUs, run more workloads, and increase productivity. Jobs are submitted to the system by Wayve researchers every day, regardless of team, and jobs are queued and launched automatically by the Run:ai system when GPUs become available. Run:ai’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic preemption, multi-node training, and more. This resulted in efficient cluster utilization of over 80% and a significant increase in the number of jobs running.

Operational Impact

Wayve's GPU utilization increased from less than 45% to over 80%.
The number of jobs running on Wayve's system increased significantly.
Wayve's teams were able to access more GPUs and run more workloads, increasing overall productivity.

Quantitative Benefit