Straggler mitigation at scale

Mehmet Fatih Aktas, Emina Soljanin

Research output: Contribution to journalArticle

2 Scopus citations

Abstract

Runtime performance variability has been a major issue, hindering predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers have been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. The tail heaviness of service time variability is decisive on the pain and gain of redundancy and we quantify its effect by deriving expressions for cost and latency. Specifically, we try to answer four questions: 1) How do replicated and coded redundancy compare in the cost vs. latency tradeoff? 2) Can we introduce redundancy after waiting some time and expect it to reduce the cost? 3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together? We validate the answers we found for each of these questions via simulations that use empirical distributions extracted from a Google cluster data.

Original languageEnglish (US)
Article number8884664
Pages (from-to)2266-2279
Number of pages14
JournalIEEE/ACM Transactions on Networking
Volume27
Issue number6
DOIs
StatePublished - Dec 2019

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Keywords

  • Coded and replicated redundancy
  • cost vs latency tradeoff in distributed computing
  • straggler relaunch

Fingerprint Dive into the research topics of 'Straggler mitigation at scale'. Together they form a unique fingerprint.

  • Cite this