Cascaded execution: Speeding up unparallelized execution on shared-memory multiprocessors

Ruth E. Anderson, Thu D. Nguyen, John Zahorjan

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck. We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times. We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark. Running on a PC with 4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.

Original languageEnglish (US)
Pages (from-to)714-719
Number of pages6
JournalProceedings of the International Parallel Processing Symposium, IPPS
StatePublished - Jan 1 1999
EventProceedings of the 1999 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing - San Juan
Duration: Apr 12 1999Apr 16 1999

Fingerprint

Data storage equipment

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

@article{08f8d44f681c494da70df731bcb104b3,
title = "Cascaded execution: Speeding up unparallelized execution on shared-memory multiprocessors",
abstract = "Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck. We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times. We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark. Running on a PC with 4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.",
author = "Anderson, {Ruth E.} and Nguyen, {Thu D.} and John Zahorjan",
year = "1999",
month = "1",
day = "1",
language = "English (US)",
pages = "714--719",
journal = "Proceedings of the International Parallel Processing Symposium, IPPS",
issn = "1063-7133",

}

Cascaded execution : Speeding up unparallelized execution on shared-memory multiprocessors. / Anderson, Ruth E.; Nguyen, Thu D.; Zahorjan, John.

In: Proceedings of the International Parallel Processing Symposium, IPPS, 01.01.1999, p. 714-719.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Cascaded execution

T2 - Speeding up unparallelized execution on shared-memory multiprocessors

AU - Anderson, Ruth E.

AU - Nguyen, Thu D.

AU - Zahorjan, John

PY - 1999/1/1

Y1 - 1999/1/1

N2 - Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck. We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times. We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark. Running on a PC with 4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.

AB - Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck. We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times. We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark. Running on a PC with 4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.

UR - http://www.scopus.com/inward/record.url?scp=0032643245&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032643245&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:0032643245

SP - 714

EP - 719

JO - Proceedings of the International Parallel Processing Symposium, IPPS

JF - Proceedings of the International Parallel Processing Symposium, IPPS

SN - 1063-7133

ER -