Our experience suggests that there is a powerful synergy between non-blocking synchronization and several good structuring techniques for the design and implementation of an operating system and supporting run-time libraries. Non-blocking synchronization significantly reduces the complexity and improves the performance of software in the signal-rich environment implemented by the Cache Kernel and supporting class libraries. Moreover, the structuring techniques we have used to achieve our overall system design goals facilitate implementing non-blocking synchronization. The biggest problem has been inadequate performance of the non-blocking synchronization instructions.
This work makes several contributions. First, we show that careful design and implementation of operating system software for efficiency, reliability and modularity makes implementing simple, efficient non-blocking synchronization far easier. In particular, type-stable memory (TSM), contention-minimizing data structuring and minimal inconsistency window structuring are important for all these reasons. These techniques are beneficial even with blocking synchronization and yet significantly reduce the complexity and improve the performance of non-blocking synchronization. Conversely, non-blocking synchronization has significant advantages in the signal-centric design of the Cache Kernel and its associated libraries, especially with the large amount of conventional operating system functionality that is implemented at the library, rather than kernel, level.
Second, we describe a number of techniques for implementing non-blocking synchronization using TSM, version numbers and DCAS. These techniques are simple to write, read, and understand, and perform well. in contrast to the CAS Our experience suggests that good DCAS support is sufficient for a practical non-blocking OS and run-time system implementation, and that single CAS is not sufficient. In fact, lack of efficient DCAS support in systems is a potential impediment to using our techniques.
Fortunately, our proposed hardware implementation indicates that it is feasible to implement efficient DCAS functionality in a modern processor with minimal additional complexity and full compatibility with the load-store architecture. The conditional load capability coupled to cache-based advisory locking further improves the hardware support, providing the advantages of locking in a lock-free implementation. The existence of software implementations of DCAS and contention reduction demonstrates that our approach is reasonable even on platforms lacking hardware support.
Efficiently supported DCAS would allow fully-synchronized standard libraries and operating system software to be portable across multiprocessors and uniprocessors without extra overhead or code complication. It would allow parallel architectures to use software developed for uniprocessors, relying on the (non-blocking) synchronization required for signals to handle serialization in the parallel processing context. This would significantly reduce the software bottleneck that has slowed the deployment of parallel processing to date.
Further work is required to evaluate the merits of hardware support for DCAS versus various software alternatives, particularly for overall system performance. Further work is also required to validate our experience that DCAS is in fact adequate in practice. However, our experience to date convinces us that the non-blocking approach is an attractive and practical way to structure operating system software. Locks will become more problematic as signals are used more extensively in libraries, synchronization becomes finer grained, and as the cost of memory delays and descheduling become even higher relative to processor speed. We hope our work encourages additional efforts in this area.