Download PDF
Altair > Case Studies > PBS Professional Manages Workload for NCI Raijin, Largest Supercomputer in Southern Hemisphere
Altair Logo

PBS Professional Manages Workload for NCI Raijin, Largest Supercomputer in Southern Hemisphere

Technology Category
  • Application Infrastructure & Middleware - Event-Driven Application
  • Sensors - Temperature Sensors
Applicable Industries
  • Construction & Infrastructure
  • Utilities
Applicable Functions
  • Maintenance
  • Product Research & Development
Use Cases
  • Construction Management
  • Inventory Management
Services
  • System Integration
  • Training
The Challenge
The National Computational Infrastructure (NCI) in Australia operates Raijin, the largest supercomputer in the Southern Hemisphere. The supercomputer handles a wide spectrum of job types, varying in scale and completion time. The challenge was to ensure overall system balance, scalable performance, and a high-quality user experience. The architecture and subsystems needed to scale as the software and hardware scale out, to protect application performance. NCI needed a highly scalable, flexible, and reliable product that could handle both the size and complexity of its computing requirements. NCI previously operated an in-house OpenPBS system with a locally customized scheduler and associated accounting system to manage its resources. However, maintaining the development and support for this system was becoming increasingly difficult, leading NCI to investigate new options for Raijin’s workload manager.
About The Customer
The National Computational Infrastructure (NCI) is Australia’s national research computing service. It provides world-class, high-end services to Australian researchers, including access to advanced computational and data-intensive methods, support, and high-performance infrastructure. NCI supports computationally-based research with a focus on the environment, climate and earth system science in particular. Since 2007, NCI’s infrastructure investments, which exceed $80M, have been provided by the Australian Government under its National Collaborative Research Infrastructure Strategy (NCRIS) and Super Science Initiatives. NCI operates Raijin, the largest supercomputer in Australia and among the top 30 systems in the world.
The Solution
NCI conducted a full system “bake-off” between several workload management and cluster management products. After a rigorous selection process, NCI selected Altair’s PBS Professional for their workload management system. Altair’s PBS Professional outperformed the competition in both performance and flexibility. Altair was required to provide a replacement grant management and accounting system to flexibly integrate with both the Raijin system and other NCI resources. Altair and NCI developed this new accounting system cooperatively during the bake-off period. PBS Professional also had to integrate with the OneSIS cluster management software chosen by NCI to manage cluster nodes and other Fujitsu hardware. Altair Professional Services were engaged to write and integrate most of the basic replacement functionality, which was then tested under simulated load, systems management and component failure conditions to ensure the system would be viable for production.
Operational Impact
  • PBS Professional is now in production on Raijin’s 57,472 Intel Sandy Bridge cores, connected by Mellanox FDR and a 9 PByte Lustre filesystem for scratch space. The system is regularly accessed by over 1000 users, with applications that comprise a very broad range of scientific application areas and packages, including both open source and licensed products. PBS Professional manages workload for these applications and, via the plugin extension system, provides functionality such as local and distributed job scratch spaces, software licenses and resource placement. In addition, the new PBS Professional accounting system is being used to manage computational, storage and cloud resources across the entire NCI facility. PBS Professional has also been installed on NCI’s OpenStack Cloud system, to be available in the future for a broader range of use cases that may not fit the standard time-shared/centrally managed system model for the current clusters.
Quantitative Benefit
  • Raijin, the supercomputer managed by PBS Professional, has 57,472 Intel Sandy Bridge cores.
  • The system is regularly accessed by over 1000 users.
  • PBS Professional manages workload for a broad range of scientific application areas and packages.

Related Case Studies.

Contact us

Let's talk!

* Required
* Required
* Required
* Invalid email address
By submitting this form, you agree that IoT ONE may contact you with insights and marketing messaging.
No thanks, I don't want to receive any marketing emails from IoT ONE.
Submit

Thank you for your message!
We will contact you soon.