Next: Software Implementation of Up: The Synergy Between Non-blocking Previous: Comparison to Blocking

Non-blocking Synchronization Primitives

Our approach assumes an efficient implementation of DCAS functionality. In this section, we briefly outline an instruction set extension to the load-linked/ store-conditional) instructions to support DCAS. (A software implementation is discussed in Section 6.1.) With a processor supporting load-linked (LL) and store-conditional (SC) instructions, add two instructions:

LLP (load-linked-pipelined): load and link to a second address after a LL. This load is linked to the following SCP.
SCP (store-conditional-pipelined): Store to the specified location provided that no modifications have been made to either of the memory cells designated by either of the most recent LL and LLP instructions and these cache lines have not been invalidated in the cache of the processor performing the SCP.

If a LLP/ SCP sequence nested within an LL/SC pair fails, the outer LL/SC pair fails too.

DCAS is then implemented by the instruction sequence shown in Figure 3 (using R4000 instructions in addition to the LL/SC(P) instructions).

Figure 3: DCAS Implementation using LL/ SC and LLP/ SCP . Success or failure of SC (and thus of the DCAS operation) is returned in U1 or whatever general register holds the argument to SC. 1 denotes success, 0 failure. If the next instruction tries to read U1, the hardware interlocks (as it already does for LL/SC) if the result of SC is not already in U1.

The LL and LLP instructions in lines 1 and 2 ``link'' the loads with the respective stores issued by the following SC and SCP instructions. Lines 3 and 4 verify that (T0) and (T1) contain V0 and V1, respectively. The SCP and SC in lines 5 and 6 are conditional. They will not issue the stores unless (T0) and (T1) have been unchanged since lines 1 and 2. This guarantees that the results of CAS in lines 3 and 4 are still valid at line 6, or else the SC fails. Further, the store issued by a successful SCP is buffered pending a successful SC. Thus, SC in line 6 writes U1 and U0 to (T1) and (T0) atomically with the comparison to V0 and V1.

We have worked out a detailed design for the implementation of these two instructions in a RISC processor such as the R4000 but the description is omitted for brevity.

Next: Software Implementation of Up: The Synergy Between Non-blocking Previous: Comparison to Blocking

Michael Greenwald
Wed Sep 18 15:42:13 PDT 1996