TY - GEN
T1 - RADICAL-Pilot and PMIx/PRRTE
T2 - 25th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2022, held in conjunction with the 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
AU - Titov, Mikhail
AU - Turilli, Matteo
AU - Merzky, Andre
AU - Naughton, Thomas
AU - Elwasif, Wael
AU - Jha, Shantenu
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Execution of heterogeneous workflows on high-performance computing (HPC) platforms present unprecedented resource management and execution coordination challenges for runtime systems. Task heterogeneity increases the complexity of resource and execution management, limiting the scalability and efficiency of workflow execution. Resource partitioning and distribution of tasks execution over portioned resources promises to address those problems but we lack an experimental evaluation of its performance at scale. This paper provides a performance evaluation of the Process Management Interface for Exascale (PMIx) and its reference implementation PRRTE on the leadership-class HPC platform Summit, when integrated into a pilot-based runtime system called RADICAL-Pilot. We partition resources across multiple PRRTE Distributed Virtual Machine (DVM) environments, responsible for launching tasks via the PMIx interface. We experimentally measure the workload execution performance in terms of task scheduling/launching rate and distribution of DVM task placement times, DVM startup and termination overheads on the Summit leadership-class HPC platform. Integrated solution with PMIx/PRRTE enables using an abstracted, standardized set of interfaces for orchestrating the launch process, dynamic process management and monitoring capabilities. It extends scaling capabilities allowing to overcome a limitation of other launching mechanisms (e.g., JSM/LSF). Explored different DVM setup configurations provide insights on DVM performance and a layout to leverage it. Our experimental results show that heterogeneous workload of 65,500 tasks on 2048 nodes, and partitioned across 32 DVMs, runs steady with resource utilization not lower than 52 %. While having less concurrently executed tasks resource utilization is able to reach up to 85 %, based on results of heterogeneous workload of 8200 tasks on 256 nodes and 2 DVMs.
AB - Execution of heterogeneous workflows on high-performance computing (HPC) platforms present unprecedented resource management and execution coordination challenges for runtime systems. Task heterogeneity increases the complexity of resource and execution management, limiting the scalability and efficiency of workflow execution. Resource partitioning and distribution of tasks execution over portioned resources promises to address those problems but we lack an experimental evaluation of its performance at scale. This paper provides a performance evaluation of the Process Management Interface for Exascale (PMIx) and its reference implementation PRRTE on the leadership-class HPC platform Summit, when integrated into a pilot-based runtime system called RADICAL-Pilot. We partition resources across multiple PRRTE Distributed Virtual Machine (DVM) environments, responsible for launching tasks via the PMIx interface. We experimentally measure the workload execution performance in terms of task scheduling/launching rate and distribution of DVM task placement times, DVM startup and termination overheads on the Summit leadership-class HPC platform. Integrated solution with PMIx/PRRTE enables using an abstracted, standardized set of interfaces for orchestrating the launch process, dynamic process management and monitoring capabilities. It extends scaling capabilities allowing to overcome a limitation of other launching mechanisms (e.g., JSM/LSF). Explored different DVM setup configurations provide insights on DVM performance and a layout to leverage it. Our experimental results show that heterogeneous workload of 65,500 tasks on 2048 nodes, and partitioned across 32 DVMs, runs steady with resource utilization not lower than 52 %. While having less concurrently executed tasks resource utilization is able to reach up to 85 %, based on results of heterogeneous workload of 8200 tasks on 256 nodes and 2 DVMs.
KW - High performance computing
KW - Middleware
KW - Resource management
KW - Runtime environment
KW - Runtime system
UR - http://www.scopus.com/inward/record.url?scp=85148692928&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85148692928&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-22698-4_5
DO - 10.1007/978-3-031-22698-4_5
M3 - Conference contribution
AN - SCOPUS:85148692928
SN - 9783031226977
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 88
EP - 107
BT - Job Scheduling Strategies for Parallel Processing - 25th International Workshop, JSSPP 2022, Revised Selected Papers
A2 - Klusácek, Dalibor
A2 - Julita, Corbalán
A2 - Rodrigo, Gonzalo P.
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 3 June 2022 through 3 June 2022
ER -