TY - GEN
T1 - Efficient runtime environment for coupled multi-physics simulations
T2 - 10th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2010
AU - Ko, Soon Heum
AU - Kim, Nayong
AU - Kim, Joohyun
AU - Thota, Abhinav
AU - Jha, Shantenu
PY - 2010
Y1 - 2010
N2 - Coupled Multi-Physics simulations, such as hybrid CFD-MD simulations, represent an increasingly important class of scientific applications. Often the physical problems of interest demand the use of high-end computers, such as TeraGrid resources, which are often accessible only via batch-queues. Batch-queue systems are not developed to natively support the coordinated scheduling of jobs - which in turn is required to support the concurrent execution required by coupled multi-physics simulations. In this paper we develop and demonstrate a novel approach to overcome the lack of native support for coordinated job submission requirement associated with coupled runs. We establish the performance advantages arising from our solution, which is a generalization of the Pilot-Job concept - which in of itself is not new, but is being applied to coupled simulations for the first time. Our solution not only overcomes the initial co-scheduling problem, but also provides a dynamic resource allocation mechanism. Support for such dynamic resources is critical for a load-balancing mechanism, which we develop and demonstrate to be effective at reducing the total time-to-solution of the problem. We establish that the performance advantage of using BigJobs is invariant with the size of the machine as well as the size of the physical model under investigation. The Pilot-Job abstraction is developed using SAGA, which provides an infrastructure agnostic implementation, and which can seamlessly execute and utilize distributed resources.
AB - Coupled Multi-Physics simulations, such as hybrid CFD-MD simulations, represent an increasingly important class of scientific applications. Often the physical problems of interest demand the use of high-end computers, such as TeraGrid resources, which are often accessible only via batch-queues. Batch-queue systems are not developed to natively support the coordinated scheduling of jobs - which in turn is required to support the concurrent execution required by coupled multi-physics simulations. In this paper we develop and demonstrate a novel approach to overcome the lack of native support for coordinated job submission requirement associated with coupled runs. We establish the performance advantages arising from our solution, which is a generalization of the Pilot-Job concept - which in of itself is not new, but is being applied to coupled simulations for the first time. Our solution not only overcomes the initial co-scheduling problem, but also provides a dynamic resource allocation mechanism. Support for such dynamic resources is critical for a load-balancing mechanism, which we develop and demonstrate to be effective at reducing the total time-to-solution of the problem. We establish that the performance advantage of using BigJobs is invariant with the size of the machine as well as the size of the physical model under investigation. The Pilot-Job abstraction is developed using SAGA, which provides an infrastructure agnostic implementation, and which can seamlessly execute and utilize distributed resources.
UR - http://www.scopus.com/inward/record.url?scp=77954940474&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954940474&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2010.107
DO - 10.1109/CCGRID.2010.107
M3 - Conference contribution
AN - SCOPUS:77954940474
SN - 9781424469871
T3 - CCGrid 2010 - 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing
SP - 349
EP - 358
BT - CCGrid 2010 - 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing
Y2 - 17 May 2010 through 20 May 2010
ER -