Let us take a quick look at a fragment of hypothetical device memory management code. Device is checking whether the memory it requires has been allocated yet or not.

if (!dptr->data[s_pos]) {
    dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
    if (!dptr->data[s_pos])
        goto out;

Suppose for a moment that two processes (we’ll call them “A” and “B”) are independently attempting to write to the same offset within the same device. Each process reaches the if test in the first line of the fragment above at the same time. If the pointer in question is NULL, each process will decide to allocate memory, and each will assign the resulting pointer to dptr->data[s_pos]. Since both processes are assigning to the same location, clearly only one of the assignments will prevail. If process A assigns first, its assignment will be overwritten by process B. The memory allocated by A, thus, will be dropped and never returned to the system. This sequence of events is a demonstration of a race condition. Race conditions are a result of uncontrolled access to shared data.

In Linux system, there are numerous sources of concurrency and, therefore, possible race conditions.

  • Multiple user-space processes are running, and they can access your code in surprising combinations of ways.
  • Symmetric multiprocessing (SMP) systems can be executing code simultaneously on different processors.
  • Kernel code is preemptible; your driver’s code can lose the processor at any time, and the process that replaces it could also be running in your driver.
  • Device interrupts are asynchronous events that can cause concurrent execution of your code.
  • The kernel also provides various mechanisms for delayed code execution, such as workqueues, tasklets, and timers, which can cause your code to run at any time in ways unrelated to what the current process is doing.
  • In hot-pluggable world, your device could simply disappear while you are in the middle of working with it.

Race Condition

Race conditions come about as a result of shared access to resources. When two threads of execution have a reason to work with the same data structures (or hardware resources), the potential for mixups always exists. Carefully-written kernel code should have a minimum of sharing. The most obvious application of this idea is to avoid the use of global variables.

The fact of the matter is, however, that such sharing is often required. Hardware resources are, by their nature, shared, and software resources also must often be available to more than one thread. Bear in mind as well that global variables are far from the only way to share data; any time your code passes a pointer to some other part of the kernel, it is potentially creating a new sharing situation.

Hardware or software resource is shared beyond a single thread of execution, and the possibility exists that one thread could encounter an inconsistent view of that resource, you must explicitly manage access to that resource. In the example above, process B’s view of the situation is inconsistent; unaware that process A has already allocated memory for the (shared) device. The usual technique for access management is called locking or mutual exclusion. When kernel code creates an object that will be shared with any other part of the kernel, that object must continue to exist (and function properly) until it is known that no outside references to it exist. The instant that above driver makes its devices available, it must be prepared to handle requests on those devices.

Semaphores and Mutexes

Our goal is to make our operations on the data structure (in above code snippet) atomic, meaning that the entire operation happens at once as far as other threads of execution are concerned. For our memory leak example, we need to ensure that if one thread finds that a particular chunk of memory must be allocated, it has the opportunity to perform that allocation before any other thread can make that test. To this end, we must set up critical sections: code that can be executed by only one thread at any given time.

Not all critical sections are the same, so the kernel provides different primitives for different needs. In this case, every access to the data structure happens in process context as a result of a direct user request; no accesses will be made from interrupt handlers or other asynchronous contexts. There are no particular latency (response time) requirements; application programmers understand that I/O requests are not usually satisfied immediately. Furthermore, the driver is not holding any other critical system resource while it is accessing its own data structures. What all this means is that if the driver goes to sleep while waiting for its turn to access the data structure, nobody is going to mind.

When a Linux process reaches a point where it cannot make any further processes, it goes to sleep (or “blocks”), yielding the processor to somebody else until some future time when it can get work done again. Processes often sleep when waiting for I/O to complete. In this case we can use a locking mechanism that might cause the process to sleep while waiting for access to the critical section.

Just as importantly, we will be performing an operation (memory allocation with kmalloc) that could sleep—so sleeps are a possibility in any case. If our critical sections are to work properly, we must use a locking primitive that works when a thread that owns the lock sleeps. Not all locking mechanisms can be used where sleeping is a possibility. For our present needs, however, the mechanism that fits best is a semaphore.

At its core, a semaphore is a single integer value combined with a pair of functions that are typically called P and V. A process wishing to enter a critical section will call P on the relevant semaphore; if the semaphore’s value is greater than zero, that value is decremented by one and the process continues. If instead, the semaphore’s value is 0 (or less), the process must wait until somebody else releases the semaphore. Unlocking a semaphore is accomplished by calling V; this function increments the value of the semaphore and, if necessary, wakes up processes that are waiting.

When semaphores are used for mutual exclusion—keeping multiple processes from running within a critical section simultaneously—their value will be initially set to 1. Such a semaphore can be held only by a single process or thread at any given time. A semaphore used in this mode is sometimes called a mutex, which is an abbreviation for “mutual exclusion”.

Semaphore Implementation

The Linux kernel provides an implementation of semaphores that conforms to the above semantics, although the terminology is a little different. The relevant type is struct semaphore; actual semaphores can be declared and initialized in a few ways. One is to create a semaphore directly, then set it up with sema_init:

void sema_init(struct semaphore *sem, int val);

where val is the initial value to assign to a semaphore. Kernel has provided a set of helper functions and macros. Thus, a mutex can be declared and initialized with one of the following:


Here, the result is a semaphore variable (called name) that is initialized to 1 (with DECLARE_MUTEX) or 0 (with DECLARE_MUTEX_LOCKED). In the latter case, the mutex starts out in a locked state; it will have to be explicitly unlocked before any thread will be allowed access.

If the mutex must be initialized at runtime (which is the case if it is allocated dynamically, for example), use one of the following:

void init_MUTEX(struct semaphore *sem);
void init_MUTEX_LOCKED(struct semaphore *sem);

In the Linux world, the P function is called down—or some variation of that name. Here, “down” refers to the fact that the function decrements the value of the semaphore and, perhaps after putting the caller to sleep for a while to wait for the semaphore to become available, grants access to the protected resources. There are three versions of down:

void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_trylock(struct semaphore *sem);

down decrements the value of the semaphore and waits as long as need be. down_interruptible does the same, but the operation is interruptible. The interruptible version is almost always the one you will want; it allows a user-space process that is waiting on a semaphore to be interrupted by the user. You do not, as a general rule, want to use noninterruptible operations unless there truly is no alternative. Non-interruptible operations are a good way to create unkillable processes (the dreaded “D state” seen in ps). Using down_interruptible requires some extra care, however, if the operation is interrupted, the function returns a nonzero value, and the caller does not hold the semaphore. Proper use of down_interruptible requires always checking the return value and responding accordingly. down_trylock never sleeps; if the semaphore is not available at the time of the call, down_trylock returns immediately with a nonzero return value.

Once a thread has successfully called one of the versions of down, it is said to be “holding” the semaphore. That thread is now entitled to access the critical section protected by the semaphore. When the operations requiring mutual exclusion are complete, the semaphore must be returned. The Linux equivalent to V is up:

void up(struct semaphore *sem);

Once up has been called, the caller no longer holds the semaphore.

As you would expect, any thread that takes out a semaphore is required to release it with one (and only one) call to up. Special care is often required in error paths; if an error is encountered while a semaphore is held, that semaphore must be released before returning the error status to the caller. Failure to free a semaphore is an easy error to make; the result (processes hanging in seemingly unrelated places) can be hard to reproduce and track down.

Spinlocks Implementation

Unlike semaphores, spinlocks may be used in code that cannot sleep, such as interrupt handlers. When properly used, spinlocks offer higher performance than semaphores in general. They do, however, bring a different set of constraints on their use. Spinlocks are simple in concept. A spinlock is a mutual exclusion device that can have only two values: “locked” and “unlocked.” It is usually implemented as a single bit in an integer value. Code wishing to take out a particular lock tests the relevant bit. If the lock is available, the “locked” bit is set and the code continues into the critical section. If, instead, the lock has been taken by somebody else, the code goes into a tight loop where it repeatedly checks the lock until it becomes available. This loop is the “spin” part of a spinlock.

The “test and set” operation must be done in an atomic manner so that only one thread can obtain the lock, even if several are spinning at any given time. The core concept is the same on all systems, however, when there is contention for a spinlock, the processors that are waiting execute a tight loop and accomplish no useful work. Spinlocks are, by their nature, intended for use on multiprocessor systems. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status.

The required include file for the spinlock primitives is . An actual lock has the type spinlock_t. Like any other data structure, a spinlock must be initialized. This initialization may be done at compile time as follows:

spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

or at runtime with:

void spin_lock_init(spinlock_t *lock);

Before entering a critical section, your code must obtain the requisite lock with

void spin_lock(spinlock_t *lock);

Note that all spinlock waits are, by their nature, uninterruptible. Once you call spin_lock, you will spin until the lock becomes available. To release a lock that you have obtained, pass it to:

void spin_unlock(spinlock_t *lock);

Spinlocks and Atomic Context

Imagine for a moment that your driver acquires a spinlock and goes about its business within its critical section. Somewhere in the middle, your driver loses the processor. Perhaps it has called a function (copy_from_user, say) that puts the process to sleep. Or, perhaps, kernel preemption kicks in, and a higher-priority process pushes your code aside. Your code is now holding a lock that it will not release any time in the foreseeable future. If some other thread tries to obtain the same lock, it will, in the best case, wait (spinning in the processor) for a very long time. In the worst case, the system could deadlock entirely.

Therefore, the core rule that applies to spinlocks is that any code must, while holding a spinlock, be atomic. It cannot sleep; in fact, it cannot relinquish the processor for any reason except to service interrupts (and sometimes not even then). The kernel preemption case is handled by the spinlock code itself. Any time kernel code holds a spinlock, preemption is disabled on the relevant processor. The last important rule for spinlock usage is that spinlocks must always be held for the minimum time possible. The longer you hold a lock, the longer another processor may have to spin waiting for you to release it, and the chance of it having to spin at all is greater.

The Spinlock Functions

We have already seen two functions, spin_lock and spin_unlock, that manipulate spinlocks. There are several other functions, however, with similar names and purposes. We will now present the full set. There are actually four functions that can lock a spinlock:

void spin_lock(spinlock_t *lock);
void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);
void spin_lock_irq(spinlock_t *lock);
void spin_lock_bh(spinlock_t *lock)

We have already seen how spin_lock works. spin_lock_irqsave disables interrupts (on the local processor only) before taking the spinlock; the previous interrupt state is stored in flags. If you are absolutely sure nothing else might have already disabled interrupts on your processor (or, in other words, you are sure that you should enable interrupts when you release your spinlock), you can use spin_lock_irq instead and not have to keep track of the flags. Finally, spin_lock_bh disables software interrupts before taking the lock, but leaves hardware interrupts enabled.

There are also four ways to release a spinlock; the one you use must correspond to the function you used to take the lock:

void spin_unlock(spinlock_t *lock);
void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
void spin_unlock_irq(spinlock_t *lock);
void spin_unlock_bh(spinlock_t *lock);

Each spin_unlock variant undoes the work performed by the corresponding spin_lock function. The flags argument passed to spin_unlock_irqrestore must be the same variable passed to spin_lock_irqsave. You must also call spin_lock_irqsave and spin_unlock_irqrestore in the same function; otherwise, your code may break on some architectures.

There is also a set of nonblocking spinlock operations:

int spin_trylock(spinlock_t *lock);
int spin_trylock_bh(spinlock_t *lock);

These functions return nonzero on success (the lock was obtained), 0 otherwise. There is no “try” version that disables interrupts.