Callisto-RTS: Fine-Grain Parallel Loops
Tim Harris, Oracle Labs; Stefan Kaestle, ETH Zürich
We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines. It supports very fine-grained scheduling of parallel loops—down to batches of work of around 1K cycles. Fine-grained scheduling helps avoid load imbalance while reducing the need for tuning workloads to particular machines or inputs. We use per-core iteration counts to distribute work initially, and a new asynchronous request combining technique for when threads require more work. We present results using graph analytics algorithms on a 2-socket Intel 64 machine (32 h/w contexts), and on an 8-socket SPARC machine (1024 h/w contexts). In addition to reducing the need for tuning, on the SPARC machines we improve absolute performance by up to 39% (compared with OpenMP). On both architectures Callisto-RTS provides improved scaling and performance compared with a state-of-the-art parallel runtime system (Galois).
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Tim Harris and Stefan Kaestle},
title = {{Callisto-RTS}: {Fine-Grain} Parallel Loops},
booktitle = {2015 USENIX Annual Technical Conference (USENIX ATC 15)},
year = {2015},
isbn = {978-1-931971-225},
address = {Santa Clara, CA},
pages = {45--56},
url = {https://www.usenix.org/conference/atc15/technical-session/presentation/harris},
publisher = {USENIX Association},
month = jul
}
connect with us