TY - GEN
T1 - Architectural support for address translation on GPUs
T2 - 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2014
AU - Pichai, Bharath
AU - Hsu, Lisa
AU - Bhattacharjee, Abhishek
PY - 2014
Y1 - 2014
N2 - The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory. To this end, we explore GPU Memory Management Units (MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) in unified heterogeneous systems.We show the challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1-parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15% of runtime). We presume this initial design leaves room for improvement but anticipate the bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this area.
AB - The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory. To this end, we explore GPU Memory Management Units (MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) in unified heterogeneous systems.We show the challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1-parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15% of runtime). We presume this initial design leaves room for improvement but anticipate the bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this area.
KW - GPUs
KW - MMUs
KW - TLBs
KW - Unified address space
UR - http://www.scopus.com/inward/record.url?scp=84897759661&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84897759661&partnerID=8YFLogxK
U2 - 10.1145/2541940.2541942
DO - 10.1145/2541940.2541942
M3 - Conference contribution
AN - SCOPUS:84897759661
SN - 9781450323055
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 743
EP - 757
BT - ASPLOS 2014 - 19th International Conference on Architectural Support for Programming Languages and Operating Systems
Y2 - 1 March 2014 through 5 March 2014
ER -