CIS 307: TOPIC: Machine Instructions for Concurrency. Spinlocks

These topics are not treated in Tanenbaum (only TestAndSet is)

[TestAndSet], [Swap and Fetch_and_Store], [Compare and Swap], [Non-Blocking Concurrent Queue], [Yield], [Spinlocks]

We consider some machine instructions that support concurrency.
TestAndSet and Swap are used (one or the other) for implementing spinlocks when more than one processor is used (hence interrupt masking is not sufficient). These two instructions, and all other instructions used for implementing spinlocks, do some reads and some writes to memory in a single atomic action.
Yield is used to support simple forms of context switching.
Always be conscious of the hardware support for Operating Systems functions. There are a number of reasons for this support. It may be the only way to solve the problem and, always, it allows solutions that are more efficient than purely software solutions.


TestAndSet

We can think of the TestAndSet instruction as a function implemented atomically in hardware. Here is this function:

int TestAndSet(int *x){ int temp = *x; *x = 1; return temp; end;

Swap and Fetch_and_Store

The Swap instruction can also be thought of as a procedure implemented atomically in hardware. Here is this procedure:

procedure Swap (a : in out integer; b : in out integer); Local Variable temp : integer; begin temp := a; a := b; b := temp; end;

It is easy to see that Swap can be used to implement TestAndSet. [Just store 1 in b, swap, and return b.] Swap can be implemented in terms of TestAndSet but it is more difficult [you need to create a critical region].

A similar atomic machine instruction is fetch_and_store:

function Fetch_and_Store (a : in out integer; b : in integer): integer; Local Variable temp : integer; begin temp := a; a := b; return temp; end;

Compare and Swap

A more powerful atomic instruction is Compare-and-Swap. This is not available on most computers but it may be implemented in terms of other atomic instructions. The instruction takes three arguments. The first is an address, the second and third are integer values, called respectively the old and the new value. The istruction atomically does the following: function Compare_and_Swap(addr: ^integer; old: integer; new: integer): boolean compare the value at addr to old if they are equal then set value at addr to new and return success else return failure This instruction has been used to create protected, non blocking objects (for example queues, as shown in the next section), that is objects that can be modified atomically and without forcing a process to wait [either busy-wait or non-busy-wait].

Non-Blocking Concurrent Queue

The following example of non-blocking algorithms is due to Michael and Scott. It specifies a queue that can be safely used by concurrent proceses without the need of any spinlock or semaphore. In the following CAS stands for CompareAndSwap.

structure pointer_t {ptr: pointer to node_t, count: unsigned integer} structure node_t {value: data type, next: pointer_t} structure queue_t {Head: pointer_t, Tail: pointer:t} initialize(Q: pointer to queue_t) node = new node() // Allocate a free node node->next.ptr = NULL // Make it the only node in the linked list Q->Head = Q->Tail = node // Both Head and Tail point to it enqueue(Q: pointer to queue_t, value: data type) node = new node() // Allocate a new node from the free list node->value = value // Copy enqueued value into node node->next.ptr = NULL // Set next pointer of node to NULL loop // Keep trying until enqueue is done tail = Q->tail // Read Tail.ptr and Tail.count together next = tail.ptr->next // Read next ptr and count fields together if tail == Q->Tail // Are tail and next consistent? if next.ptr == NULL // Was Tail pointing to the last node? if CAS(&tail.ptr->next,next,[node,next.count+1]) // try to link // node at the end of the linked list break // Enqueue is done, exit loop endif else CAS(&Q->Tail,tail,[next.ptr,tail.count+1]) // Try to swing // Tail to the next node endif endloop CAS(&Q->Tail, tail, [node,tail.count+1]) // Enqueue is done. Try to // swing tail to the inserted node. dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean loop //Keep trying until dequeue is done head = Q->Head //Read Head tail = Q->Tail //Read Tail next = head->next //Read Head.ptr->next if head == Q->Head //Are head, tail, and next consistent? if head.ptr == tail.ptr //Is queue empty or Tail falling behind? if next.ptr == NULL //Is queue empty? return FALSE //Queue is empty, could't dequeue endif CAS(&Q->Tail,tail,[next.ptr,tail.count+1]) //Tail is falling behind. //Try to advance it else //No need to deal with Tail *pvalue = next.ptr->value//Read value before CAS, otherwise //another dequeue might free the next node if CAS(&Q->Head,head,[next.ptr,head.count+1]) //Try to swing Head //to the next node break //Dequeue is done. Exit loop endif endif endloop free(head.ptr) //It is safe now to free the old dummy node return TRUE //Queue wasnot empty, dequeue succeeded

Yield

Also the Yield instruction can be thought of as a procedure implemented in hardware.

procedure Yield (a : address; b : address); begin MEMORY[a] := ProgramCounter; ProgramCounter := MEMORY[b]; end;

Yield is used to switch from one process to another. Let's represent with P^ the location where in a process P we save its PC at the time of the switch, and let's represent with * the location P^ of the current process P. Then the switch from process P to process Q can be effected with

It is easy to use Yield to implement Co-routine structures. For example, the "RESUME Q^" statement is replaced simply by "Yield *, Q^".
Equally easy is to implement a non-preemptive scheduler. Say that the location for the scheduler S is S^ and S manages the ready queue. Then a process gives up control with "Yield *, S^" and S will dispatch the selected ready process R with "yield *, R^".

Of course we have simplified our discussion by not worrying about the hardware context of processes (not even about the content of the General Purpose Registers). But of course, since we are not in a preemptive situation, it is easy for the processes to save and restore the appropriate information when using yield.


SpinLocks

Spinlocks are abstract data types, or classes, that support the Busy-Wait solution of the Mutual Exclusion Problem.
The operations on spinlocks are InitLock, Lock, and UnLock.
If mutex is a spinlock that has been initialized using InitLock, then Critical Regions S1, S2, .., Sn can be protected from each other by preceding each region by a call to Lock and following it by a call to UnLock. So the region Si becomes: Lock(mutex); Si; UnLock(mutex);

The pseudo code for implementing SpinLocks follows:

TYPE SpinLock is boolean; PROCEDURE InitLock(L : in out SpinLock); BEGIN L := false; END; PROCEDURE Lock(L : in out SpinLock); BEGIN WHILE TestAndSet(L) DO null; END; PROCEDURE UnLock(L : in out SpinLock); BEGIN L := false; END; As you can see, the implementation is quite simple and with no system overhead. It works because of the properties of the TestAndSet instruction.
Notice that this solution involves Busy Waiting.

A big problem of the spinlock is that by repeatedly testandsetting a location in memory it can use a high percentage of the memory bandwidth, thus reducing the utilization of other procesors sharing the memory. This location that is repeatedly accessed, is called a Hot Spot in the memory. A partial solution (in the case of cache-coherent shared memories) is to change the Lock operation as follows:

PROCEDURE Lock(L : in out SpinLock); BEGIN WHILE TestAndSet(L) DO WHILE (L == 1) DO null; END; which uses less of the memory bandwidth since the TestAndSet instruction involves both a read and a write transaction on the memory while (L=1) involves only a read operation, and this read operation is normally carried out on a cached copy. [The While Loop operates on the cached copy without affecting the shared memory. It assumes that the hardware supports some form of cache coherence algorithm which invalidates the cache copy when the shared memory copy is updated.]

Mellor-Crummey and Scott have defined a spinlock that does not require the presence of a coherent cache, yet it does not provoke a hot spot:

type qnode = record next : ^qnode // ptr to successor in queue locked : Boolean // busy-waiting necessary type lock = ^qnode; // ptr to tail of queue // I points to a queue link record allocated in shared memory // and locally-accessible and specific to the invoking processor. procedure acquire_lock(L: ^lock; I: ^qnode) var pred : ^qnode; I->next := nil; // Initially no successor pred := fetch_and_store(L,I); // queue for lock if pred != nil // lock was not free I->locked := true; // prepare to spin pred->next := I; // link behind predecessor repeat while I->locked; // busy-wait for lock procedure release_lock(L: ^lock; I: ^qnode); if I->next = nil // no known successor if compare_and_swap(L, I, nil) return // no successor, lock free repeat while I->next = nil; // wait for successor I->next->locked ;= false; // pass lock

There is no guaranty of Fairness for spinlocks. That is, if two processes P and Q are trying to enter a critical region protected by a locked spinlock mutex, then there is no guaranty that they will be serviced in FIFO order. Their order of entry will be random. More serious is the problem that the process P may use the spinlock within a loop and use the lock repeatedly while Q is unable to acquire it. [How can this happen?] This behavior could be avoided by giving a "Nice" behavior to the UnLock operation: when the Unlock is executed, a Yield instruction is used to check if another process is waiting and if yes, to defer to it. Of course, in most situation this is an unwarranted precaution.

Spinlocks also are dangerous to use because of the Priority Inversion Problem. See the discussion of this issue in Tanenbaum, page 39.
Two possible solutions to the priority inversion problem:

You may see the danger of the Priority Inversion Problem in the following note about the Mars Rover.

A related problem is the formation of scheduling convoys. This occurs when a process ceases to run in the middle of a critical region [because of page fault, quantum expiration, external interrupt] and then all other processes sharing this region have to wait for its completion. A number of techniques are being introduced to deal with this problem, in particular Preemption-safe Locking and Non-blocking Algorithms. [Preemption-safe locking tries to avoid preemption during a critical region by having the kernel extend the quantum of a process in the critical region or by increasing its priority. Non-blocking algorithms, expecially for implementing shared counters and queues, make sure that if a process blocks within a critical region, it can be brought back to the beginning of the critical region and another process is allowed to enter the region. An example of this technique, for queues, was seen earlier in these notes.] Of course all blocking locks [like spinlocks and semaphores] can give origin to deadlocks.

In general spinlocks with their busy waiting mechanism are desirable (as compared to non busy waiting mechanisms like semaphores) when the scheduling and context switching time are greater than the time required to execute the critical region, or when there are processors not required to do other tasks.

The efficient implementation of spinlocks is a great concern because spinlocks are the usual foundation for synchronization, expecially in multiprocessor systems. They are used everywhere and very frequently. As the use of multiprocessor systems incrases so will the interest in efficient spinlocks and in alternative synchronization mechanisms

ingargiola.cis.temple.edu