We present a study of the architectural requirements and scalability of the NAS Parallel Benchmarks. Through direct measurements and simulations, we identify the factors which affect the scalability of benchmark codes on two relevant and distinct platforms; a cluster of workstations and a ccNUMA SGI Origin 2000. We find that the benefit of increased global cache size is pronounced in certain applications and often offsets the communication cost. By constructing the working set profile of the benchmarks, we are able to visualize the improvement of computational efficiency under constant-problem-size scaling. We also find that, while the Origin MPI has better point-to-point performance, the cluster MPI layer is more scalable with communication load. However, communication performance within the applications is often much lower than what would be achieved by micro-benchmarks. We show that the communication protocols used by MPI runtime library are influential to the communication performance in applications, and that the benchmark codes have a wide spectrum of communication requirements.