TY - GEN
T1 - Critical points based register-concurrency autotuning for GPUs
AU - Li, Ang
AU - Song, Shuaiwen Leon
AU - Kumar, Akash
AU - Zhang, Eddy Z.
AU - Chavarría-Miranda, Daniel
AU - Corporaal, Henk
N1 - Publisher Copyright:
© 2016 EDAA.
PY - 2016/4/25
Y1 - 2016/4/25
N2 - The unprecedented prevalence of GPGPU is largely attributed to its abundant on-chip register resources, which allow massively concurrent threads and extremely fast context switch. However, due to internal memory size constraints, there is a tradeoff between the per-thread register usage and the overall thread concurrency. This becomes a design problem in terms of performance tuning, since the performance sweet spot which can be significantly affected by these two factors is generally unknown beforehand. In this paper, we propose an effective autotuning solution to quickly and efficiently select the optimal number of registers perthread for delivering the best GPU performance. Experiments on three generations of GPUs (Nvidia Fermi, Kepler and Maxwell) demonstrate that our simple strategy can achieve an average of 10% performance improvement while a max of 50% over the original version without modifying the user code. Additionally, to reduce local cache misses due to register spilling and further improve performance, we explore three optimization schemes (i.e. bypass L1 for global memory access, enlarge local L1 cache and spill into shared memory) and discuss their impact on performance on a Kepler GPU.
AB - The unprecedented prevalence of GPGPU is largely attributed to its abundant on-chip register resources, which allow massively concurrent threads and extremely fast context switch. However, due to internal memory size constraints, there is a tradeoff between the per-thread register usage and the overall thread concurrency. This becomes a design problem in terms of performance tuning, since the performance sweet spot which can be significantly affected by these two factors is generally unknown beforehand. In this paper, we propose an effective autotuning solution to quickly and efficiently select the optimal number of registers perthread for delivering the best GPU performance. Experiments on three generations of GPUs (Nvidia Fermi, Kepler and Maxwell) demonstrate that our simple strategy can achieve an average of 10% performance improvement while a max of 50% over the original version without modifying the user code. Additionally, to reduce local cache misses due to register spilling and further improve performance, we explore three optimization schemes (i.e. bypass L1 for global memory access, enlarge local L1 cache and spill into shared memory) and discuss their impact on performance on a Kepler GPU.
UR - http://www.scopus.com/inward/record.url?scp=84973636438&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84973636438&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84973636438
T3 - Proceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
SP - 1273
EP - 1278
BT - Proceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
Y2 - 14 March 2016 through 18 March 2016
ER -