Recent studies have shown the potential of last-level TLBs shared by multiple cores in tackling memory translation performance challenges posed by 'big data' workloads. A key stumbling block hindering their effectiveness, however, is their high access time. We present a design methodology to reduce these high access times so as to realize high-performance and scalable shared L2 TLBs. As a first step, we study the benefits of replacing monolithic shared TLBs with a distributed set of small TLB slices. While this approach does reduce TLB lookup latency, it increases interconnect delays in accessing remote slices. Therefore, as a second step, we devise a lightweight single-cycle interconnect among the TLB slices by tailoring wires and switches to the unique communication characteristics of memory translation requests and responses. Our approach, which we dub Nocstar (NOCs for scalable TLB architecture), combines the high hit rates of shared TLBs with low access times of private L2 TLBs, enabling significant system performance benefits.