Download PDF
Run:AI > Case Studies > Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’
Run:AI Logo

Autonomous Vehicle Company Wayve Ends GPU Scheduling ‘Horror’

Technology Category
  • Analytics & Modeling - Machine Learning
  • Application Infrastructure & Middleware - Data Exchange & Integration
Applicable Industries
  • Automotive
Applicable Functions
  • Discrete Manufacturing
  • Product Research & Development
Use Cases
  • Autonomous Transport Systems
  • Machine Condition Monitoring
Services
  • Cloud Planning, Design & Implementation Services
  • Data Science Services
The Challenge
Wayve, a London-based company developing artificial intelligence software for self-driving cars, was facing a significant challenge with their GPU resources. Their Fleet Learning Loop, a continuous cycle of data collection, curation, training of models, re-simulation, and licensing models before deployment into the fleet, was consuming a large amount of GPU resources. However, despite nearly 100 percent of GPU resources being allocated to researchers, less than 45 percent of resources were utilized. This was due to the fact that GPUs were statically assigned to researchers, meaning when researchers were not using their assigned GPUs others could not access them. This created the illusion that GPUs for model training were at capacity even as many GPUs sat idle.
About The Customer
Wayve is a London-based company that is developing artificial intelligence software for self-driving cars. The company's unique approach to autonomous driving technology does not rely on expensive sensing equipment. Instead, Wayve focuses on developing greater intelligence for better autonomous driving in dense urban areas. The company's primary GPU compute consumption comes from the Fleet Learning Loop production training. They train the product baseline with the full dataset over many epochs, and continually re-train as they collect new data through iterations of the fleet learning loop.
The Solution
Wayve turned to Run:ai for a solution to their GPU resource and scheduling issues. Run:ai implemented a system that removed silos and eliminated static allocation of resources. They created pools of shared GPUs, allowing teams to access more GPUs, run more workloads, and increase productivity. Jobs are submitted to the system by Wayve researchers every day, regardless of team, and jobs are queued and launched automatically by the Run:ai system when GPUs become available. Run:ai’s dedicated batch scheduler, running on Kubernetes, enables crucial features for the management of DL workloads like advanced queuing and quotas, managing priorities and policies, automatic preemption, multi-node training, and more. This resulted in efficient cluster utilization of over 80% and a significant increase in the number of jobs running.
Operational Impact
  • Wayve's GPU utilization increased from less than 45% to over 80%.
  • The number of jobs running on Wayve's system increased significantly.
  • Wayve's teams were able to access more GPUs and run more workloads, increasing overall productivity.
Quantitative Benefit
  • Increase in GPU utilization from less than 45% to over 80%.
  • Significant increase in the number of jobs running on the system.
  • Increased access to GPUs for teams, leading to increased productivity.

Related Case Studies.

Contact us

Let's talk!

* Required
* Required
* Required
* Invalid email address
By submitting this form, you agree that IoT ONE may contact you with insights and marketing messaging.
No thanks, I don't want to receive any marketing emails from IoT ONE.
Submit

Thank you for your message!
We will contact you soon.