Hadoop on HPC: Integrating hadoop and pilot-based dynamic resource management

Andre Luckow, Ioannis Paraskevakos, George Chantzialexiou, Shantenu Jha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Scopus citations

Abstract

High performance distributed computing environments have traditionally been designed to meet the compute demands of scientific applications, supercomputers have historically been producers and not consumers of data. The Apache Hadoop ecosystem has evolved to address many of the traditional limitations of HPC platforms. There exist a whole class of scientific applications that need the collective capabilities of traditional high-performance computing environments and the Apache Hadoop ecosystem. For example, the scientific domains of bio-molecular dynamics, genomics and high-energy physics need to couple traditional computing with Hadoop/Spark based analysis. We investigate the critical question of how to present both capabilities to such scientific applications. Whereas this questions needs answers at multiple levels, we focus on the design of middleware that might support the needs of both. We propose extensions to the Pilot-Abstraction so as to provide a unifying resource management layer. This provides an important step towards integration and thereby interoperable use of HPC and Hadoop/Spark, and allows applications to efficiently couple HPC stages (e.g. simulations) to data analytics. Many supercomputing centers have started to officially support Hadoop environments either in a dedicated environment or in hybrid deployments using tools, such as myHadoop. However, this typically involves many intrinsic, environment-specific details that need to be mastered, and often swamp conceptual questions like: How best to couple HPC and Hadoop application stages? How to explore runtime trade-offs (data localities vs. data movement)? This paper provides both conceptual understanding and practical solutions to questions central to the integrated use of HPC and Hadoop environments. Our experiments are performed on state-of-the-art production HPC environments and provide middleware for multiple domain sciences.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1607-1616
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Event30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Other

Other30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Country/TerritoryUnited States
CityChicago
Period5/23/165/27/16

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications

Keywords

  • Big Data
  • HPC
  • Hadoop

Fingerprint

Dive into the research topics of 'Hadoop on HPC: Integrating hadoop and pilot-based dynamic resource management'. Together they form a unique fingerprint.

Cite this