Loading…

Improving HPC System Throughput and Response Time using Memory Disaggregation

HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zacarias, Felippe Vieira, Carpenter, Paul, Petrucci, Vinicius
Format:	Conference Proceeding
Language:	English
Subjects:	Computational modeling Conferences Disaggregation Memory management Performance degradation Performance prediction Resource management Resource scheduling Slurm Throughput Time factors
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommodate normal and large capacity nodes, a mismatch between the system and the memory demands of the scheduled jobs can lead to inefficient usage of both memory and compute resources. Disaggregated memory has recently been proposed as a way to mitigate this problem by flexibly allocating memory capacity across cluster nodes. This paper presents a simulation approach for at-scale evaluation of job schedulers with disaggregated memories and it introduces a new disaggregated-aware job allocation policy for the Slurm resource manager. Our results show that using disaggregated memories, depending on the imbalance between the system and the submitted jobs, a similar throughput and job response time can be achieved on a system with up to 33% less total memory provisioning.
ISSN:	2690-5965
DOI:	10.1109/ICPADS53394.2021.00041