TY - JOUR
T1 - Fast parallel sorting under LogP
T2 - Experience with the CM-5
AU - Dusseau, Andrea C.
AU - Culler, David E.
AU - Schauser, Klaus Erik
AU - Martin, Richard P.
N1 - Funding Information:
by a National Science Foundation Presidential Faculty Fellowship CCR-9253705 and LLNL Grant UCB-ERL-92/172. Klaus Schauser was supported by an IBM Graduate Fellowship and received research support from the Department of Computer Science at the University of California, Berkeley. Richard Martin was supported by a University of California, Berkeley Fellowship.
Funding Information:
The authors wish to acknowledge the computational support provided by the National Science Foundation Infrastructure Grant number CDA-8722788 and the Advanced Computing Laboratory of Los Alamos National Laboratory. Andrea Dusseau was supported by a National Science Foundation Graduate Research Fellowship. David Culler was supported
PY - 1996
Y1 - 1996
N2 - In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.
AB - In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.
KW - Algorithm implementation
KW - Communication performance
KW - Communication schedules
KW - Data layout
KW - Massively parallel processing
KW - Models of parallel computation
KW - Performance analysis
KW - Performance prediction
KW - Sorting
UR - http://www.scopus.com/inward/record.url?scp=0030216116&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0030216116&partnerID=8YFLogxK
U2 - 10.1109/71.532111
DO - 10.1109/71.532111
M3 - Article
AN - SCOPUS:0030216116
SN - 1045-9219
VL - 7
SP - 791
EP - 805
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 8
ER -