TY - GEN
T1 - Locality-aware software throttling for sparse matrix operation on GPUs
AU - Chen, Yanhao
AU - Hayes, Ari B.
AU - Zhang, Chi
AU - Zhang, Eddy Z.
AU - Salmon, Timothy
PY - 2020/1/1
Y1 - 2020/1/1
N2 - This paper tackles the cache thrashing problem caused by the non-deterministic scheduling feature of bulk synchronous parallel (BSP) execution in GPUs. In the BSP model, threads can be executed and interleaved in any order before reaching a barrier synchronization point, which requires the entire working set to be in cache for maximum data reuse over time. However, it is not always possible to fit all the data in cache at once. Thus, we propose a locality-aware software throttling framework that throttles the number of active execution tasks, prevents cache thrashing, and enhances data reuse over time. Our locality-aware software throttling framework focuses on an important class of applications that operate on sparse matrices (graphs). These applications come from the domains of linear algebra, graph processing, machine learning and scientific simulation. Evaluated on over 200 real sparse matrices and graphs that suffer from cache thrashing in the Florida sparse matrix collection, our technique achieves an average of 2.01X speedup, a maximum of 6.45X speedup, and a maximum performance loss ≤5%.
AB - This paper tackles the cache thrashing problem caused by the non-deterministic scheduling feature of bulk synchronous parallel (BSP) execution in GPUs. In the BSP model, threads can be executed and interleaved in any order before reaching a barrier synchronization point, which requires the entire working set to be in cache for maximum data reuse over time. However, it is not always possible to fit all the data in cache at once. Thus, we propose a locality-aware software throttling framework that throttles the number of active execution tasks, prevents cache thrashing, and enhances data reuse over time. Our locality-aware software throttling framework focuses on an important class of applications that operate on sparse matrices (graphs). These applications come from the domains of linear algebra, graph processing, machine learning and scientific simulation. Evaluated on over 200 real sparse matrices and graphs that suffer from cache thrashing in the Florida sparse matrix collection, our technique achieves an average of 2.01X speedup, a maximum of 6.45X speedup, and a maximum performance loss ≤5%.
UR - http://www.scopus.com/inward/record.url?scp=85071071258&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071071258&partnerID=8YFLogxK
M3 - Conference contribution
T3 - Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
SP - 413
EP - 425
BT - Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
PB - USENIX Association
T2 - 2018 USENIX Annual Technical Conference, USENIX ATC 2018
Y2 - 11 July 2018 through 13 July 2018
ER -