NUMACROS: Data Parallel Programming on NUMA Multiprocessors
Hui Li and Kenneth C. Sevcik
Computer Systems Research Institute
University of Toronto
Toronto, CANADA
Abstract
Data parallel programming has been widely used in developing
scientific applications on various types of parallel machines: SIMD,
MIMD distributed memory machines, and UMA shared memory machines.
On NUMA shared memory machines, data locality is the key to good
performance of parallel applications. In this paper, we propose a set
of macros (NUMACROS) for data parallel programming on NUMA machines.
NUMACROS attempts to achieve both ease of programming and data
locality for performance. Programs written using NUMACROS are nearly
as short and easily readable as sequential versions of the programs.
To obtain data locality, data and loops are distributed and
partitioned in a coordinated fashion among the processors. Although
global address spaces facilitate data distribution on NUMA systems, a
naive implementation of an application will suffer from high costs. To
reduce the cost, a number of approaches have been proposed and
evaluated. These include index precomputing, index checking, loop
transformation, and others. Our experimental results, with the Hector
multiprocessor, show that these approaches are effective. While such
facilities will be provided by compilers in the long run, NUMACROS is
a helpful interim step.
Download the full text of this paper in
ASCII form (43,524 bytes).
To Become a USENIX Member, please see our
Membership Information.