DCAS functionality can be implemented in software using a technique introduced by Bershad [4]. DCAS is implemented using a lock known to the operating system. If a process holding this locks is delayed by a context switch, the operating system rolls back the process out of the DCAS procedure and releases the lock. The rollback procedure is relatively simple because the DCAS implementation is simple and known to the operating system. Moreover, the probability of a context switch in the middle of the DCAS procedure is low because it is so short, typically a few instructions. Thus, the rollback cost is incurred infrequently.
This technique can be used more generally to implement other primitives such as n-location CAS. We focus on DCAS implementation because the primary relation to our work is offering a software implementation of DCAS as an alternative to our proposed hardware support. It also seems simpler to just implement rollback for DCAS compared to more general primitives.
This approach has the key advantage of not requiring hardware extensions over the facilities in existing systems. Moreover, its performance may be comparable to our hardware extensions, especially on single processors or small-scale multiprocessors. Further measurements are required here. However, there are a few concerns. First, there is the cost of locking. The straight-forward implementation requires the DCAS procedure to access a common global lock from all processes. In a multi-level memory with locks in memory, the memory contention between processors for this lock can be significant. For example, the data structure may be in a shared segment that is mapped in by two independent processes. If the locks are associated with each DCAS instance, there is more cost and complexity to designate the locks and critical section to the operating system and to implement the rollback. The locking and unlocking also modifies the cache line containing the lock, further increasing the cost of this operation because writeback is required.
Second, Bershad's approach requires rereading the two locations from memory as well as an extra read and write to set the lock and write to clear the lock.
Third, on multiprocessors, care must be used by readers of shared data structures if they want to support unsynchronized reads. Without depending on the lock, readers can see intermediate states of the DCAS, and read tentative values that are part of a DCAS that fails. Requiring synchronization for reads significantly increases contention on the global lock. Note that in many cases TSM reduces the danger of unsynchronized reads because the reads cannot cause type errors. Writes are protected by the global lock, and the final DCAS will detect that the unsynchronized reads were suspect, and fail. Systems that provide hardware DCAS require no additional read synchronization beyond that performed automatically by the memory system. Further experience and measurements are required to determine whether this is a significant issue on real systems.
Finally, the Bershad mechanism seems harder to test under all conditions. For instance, it is possible that one of the write operations that the rollback needs to undo is to an area of memory that has been paged out or that one of the addresses is illegal. The system also needs to ensure that a thread is rolled back out of any DCAS critical section if it is terminated. We believe our hardware implementation is simpler to verify and naturally operates on top of the virtual memory management of the system and on top of directly accessible physical memory at the lowest level of the system software. It is of concern that a minor change to the software mechanisms in Bershad's scheme could result in very subtle errors in execution that could go undetected in a system for a long period of time.