Collaborative Research: Frameworks: Designing Next-Generation MPI Libraries for Emerging Dense GPU Systems

Project Details

Description

The extremely high compute and communication capabilities offered by modern Graphics Processing Units (GPUs) and high-performance interconnects have led to the creation of High-Performance Computing (HPC) platforms with multiple GPUs and high-performance interconnects per node. Unfortunately, state-of-the-art production quality implementations of the popular Message Passing Interface (MPI) programming model do not have the appropriate support to deliver the best performance and scalability for applications on such dense GPU systems. These developments in High-End Computing (HEC) technologies and associated middleware issues lead to the following broad challenge: How can existing production quality MPI middleware be enhanced to take advantage of emerging networking technologies to deliver the best possible scale-up and scale-out for HPC and Deep Learning (DL) applications on emerging dense GPU systems? A synergistic and comprehensive research plan, involving computer scientists from The Ohio State University (OSU) and Ohio Supercomputer Center (OSC) and computational scientists from the Texas Advanced Computing Center (TACC), and San Diego Supercomputer Center (SDSC) and University of California San Diego (UCSD), is proposed to address the above broad challenges with innovative solutions. The proposed framework will be made available to collaborators and the broader scientific community to understand the impact of the proposed innovations on next-generation HPC and DL frameworks and applications in various science domains. Multiple graduate and undergraduate students will be trained under this project as future scientists and engineers in HPC. The proposed work will enable curriculum advancements via research in pedagogy for key courses in the new Data Science programs at OSU, SDSC and TACC. The established national-scale training and outreach programs at TACC, SDSC and OSC will be used to disseminate the results of this research to XSEDE users. Tutorials and workshops will be organized at PEARC, SC and other conferences to share the research results and experience with the community. The project is aligned with the National Strategic Computing Initiative (NSCI) to advance US leadership in HPC and the recent initiative of the US Government to maintain leadership in Artificial Intelligence (AI.)

The proposed innovations include: 1) Designing high-performance and scalable point-to-point, and collective communication operations that fully utilize multiple network adapters and advanced in-network computing features for GPU and CPU buffers within and across nodes; 2) Designing novel datatype processing and unified memory management to improve application performance; 3) Designing CUDA-aware I/O subsystem to accelerate MPI I/O and checkpoint-restart for HPC and DL applications; 4) Designing support for containerized environments to better enable easy deployment of proposed solutions on modern cloud environments; and 5) Carry out integrated development and evaluation to ensure proper integration of proposed designs with the driving applications. The proposed designs will be integrated into the widely-used MVAPICH2 library and made available. The project team members will work closely with internal and external collaborators to facilitate wide deployment and adoption of released software. The proposed solutions will be targeted to enable scale-up and scale-out of the driving science domains (molecular dynamics, lattice QCD, seismology, image classification, and fusion research) on emerging dense GPU platforms. The transformative impact of the proposed development effort is to achieve scalability, performance, and portability out of HPC and DL frameworks and applications to take advantage of emerging dense GPU platforms and hence, leading to significant advancements in science and engineering.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

StatusFinished
Effective start/end date9/25/1710/31/23

Funding

  • National Science Foundation: $383,161.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.