Concurrency

I want to add a second word in the title, something like “Koans” or “Vignettes”, but I don’t know a word with the right connotations.

I realized recently that I have been walking around for a long time with some confusion and unknown unknowns about how concurrency works in various settings, and decided to write about it until I stopped being confused. This post doesn’t therefore have much of a “point”.

Concurrency and Parallelism

Wikipedia, as of time of writing:

Concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the outcome.

There are two broad reasons concurrency is useful. One is for performance: if you want your computer to perform as many floating point operations as possible by lunchtime, you want all CPUs/GPUs/etc. to be performing operations simultaneously. Another is that you’re in a problem domain where you simply can’t predict the order of events: you’re writing a user interface, and the user can click on any of multiple buttons in any order; or you’re writing a web server, and any number of clients can request any pages in any order. These reasons are not mutually exclusive.

(As a slight generalization of the second reason, there are times when a concurrent producer and consumer are actually easier to write or reason about, even when the order of events is in principle completely predictable. Say, you’re parsing some recursive structure that’s spread out over a bunch of files that you want to pretend are concatenated, and wish for the recursive parsing to work in one pass without having to reason about file boundaries.)

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously.

Parallelism requires independent things that can actually run calculations simultaneously: different CPUs on a computer, or different computers.

If you alternate between chunks of one task, that’s concurrency but not parallelism. If you do two unrelated things on two computers, that’s parallelism but not concurrency.

Add

In the programming language Add, which I just made up, programs consist of a sequence of integers. To execute a program, you initialize an accumulator to 0, and then add each integer to it.

This Python program executes an Add program:

program = [3, 1, 4, 1, 5, 9]
acc = 0
for inst in program:
    acc += inst

This Python program concurrently executes two Add programs in Python threads:

program1 = [3, 1, 4, 1, 5, 9]
program2 = [2, 7, 1, 8, 2, 8]
def exec_add(program):
    acc = 0
    for inst in program:
        acc += inst

from threading import Thread
Thread(target=exec_add, args=(program1,)).start()
Thread(target=exec_add, args=(program2,)).start()

This Python program concurrently executes two Add programs without using any Python concurrency features (assuming they’re equally long, because I’m lazy):

program1 = [3, 1, 4, 1, 5, 9]
program2 = [2, 7, 1, 8, 2, 8]
acc1 = 0
acc2 = 0
for inst1, inst2 in zip(program1, program2):
    acc1 += inst1
    acc2 += inst2

Threads and Processes

The most general definitions that attempt to sharply distinguish the two are that processes and threads are both running sequences of computer instructions, but processes don’t share state and threads do. However, both words are also used in more informal or slippery ways.

Processes often refer to a specific kind of operating system abstraction, the things that show up when you run ps or open the Task Manager or Activity Monitor. One of the key guarantees of operating systems (in theory) is process isolation: different processes have their own registers, memory, and resources, and can’t affect each other unless they’re deliberate about it. Ways in which inter-process communication occurs include reading/writing files, opening sockets, and sending signals.

However, some programming languages/VMs offer their own similar abstractions, which may also be called “processes” and may not be based on processes in the underlying OS. Erlang is one such example. So what “process” means depends a lot on the context.
Threads can be operating system or processor abstractions, but are often user-space abstractions, defined by the programming language or implementation. Threads usually share memory. What that looks like in a typical programming language is that threads can modify variables and see each other’s modifications.

Like processes, threads’ implementations may or may not be based on OS functionality. Threads that aren’t implemented with any OS functionality, like our last Add interpreter, are sometimes called green threads.

(If you clicked the Erlang link, you may have noticed how it says concurrency is “programs that can handle multiple threads of execution at the same time” and “each thread of execution is called a process”, demonstrating the nebulousness of the words in practice.)

Processes are “larger” than threads. You can remember this by noting that the word “process” is longer than the word “thread”. (It also has more syllables.)

Machine code

At the level of typical multithreaded machine code on a typical CPU, when two threads execute two sequences of instructions, those instructions can basically be interleaved in any order. (They may not even be linearizable/serializable, but let’s not worry about that for now.) When two threads both execute x += 1 (or rather, the machine code it compiles to), they may end up only incrementing x once overall, because x += 1 is (typically, at least) two instructions: “read x” and “write x + 1”, and they can get interleaved as follows:

thread A reads x
thread B reads x
thread A writes x + 1
thread B writes x + 1

Here’s a short C++ program for demonstrating a race. (Note to self, who rarely compiles standalone C++ programs: the flag you want is -pthread.)

#include <iostream>
#include <thread>
#include <vector>

int accumulator = 0;
void inc() {
    for (int i = 0; i < 1000; i++) accumulator++;
}

int main() {
    std::vector<std::thread> threads;
    for (int i = 0; i < 1000; i++) threads.push_back(std::thread(inc));
    for (std::thread& t : threads) t.join();
    std::cout << accumulator << std::endl;
}

Many programming language data structures are not thread-safe for performance reasons.¹ This C++ program reliably segfaults or hits some other memory corruption.

#include <iostream>
#include <thread>
#include <vector>

std::vector<int> accumulator;
void inc() {
    for (int i = 0; i < 1000; i++) accumulator.push_back(i);
}

int main() {
    std::vector<std::thread> threads;
    for (int i = 0; i < 1000; i++) threads.push_back(std::thread(inc));
    for (std::thread& t : threads) t.join();
    std::cout << accumulator << std::endl;
}

This level of abstraction is where synchronization primitives shine: mutexes, semaphores, condition variables, all that jazz.

To give another example, Java behaves similarly. This doesn’t immediately follow from the C++ example: Java programs compile down to bytecode for the Java Virtual Machine (JVM) rather than machine code, and you could theoretically imagine that there’s a bytecode instruction that atomically increments a variable, or that some other quirk of how JVM threads work makes the race less likely. But JVM threads executing bytecode interleave their instructions similarly arbitrarily.

(Wow it has been ages since I wrote Java.)

import java.util.ArrayList;
class Threads {
    static int accumulator = 0;
    public static void main(String[] args) throws InterruptedException {
        ArrayList<Thread> threads = new ArrayList<Thread>();
        for (int i = 0; i < 1000; i++) {
            Thread t = new Thread() {
                public void run() {
                    for (int i = 0; i < 1000; i++) {
                        Threads.accumulator++;
                    }
                }
            };
            t.start();
            threads.add(t);
        }
        for (Thread t : threads) {
            t.join();
        }
        System.out.println(accumulator);
    }
}

Java collections are similarly not thread-safe by default. Trying to add to a list concurrently from multiple threads causes exceptions galore. (Though, I’d take that over memory corruption seven days of the week.)

import java.util.*;
class Threads {
    static List<Integer> accumulator = new ArrayList<Integer>();
    public static void main(String[] args) throws InterruptedException {
        ArrayList<Thread> threads = new ArrayList<Thread>();
        for (int i = 0; i < 1000; i++) {
            Thread t = new Thread() {
                public void run() {
                    for (int i = 0; i < 1000; i++) {
                        Threads.accumulator.add(i);
                    }
                }
            };
            t.start();
            threads.add(t);
        }
        for (Thread t : threads) {
            t.join();
        }
        System.out.println(accumulator.size());
    }
}

However, sychronization primitives are a bit more first-class in Java. Various constructs with the keyword synchronized can be used to access mutexes, and the one-line fix to the above race is to initialize accumulator as Collections.synchronizedList(new ArrayList<Integer>()) instead.

In theory, Python is in the same boat: its programs compile down to bytecode, and threads executing Python bytecode instructions can have those instructions arbitrarily interleaved. In practice, you’ll likely find it’s harder to get a race to show up with the same code:

from threading import Thread

accumulator = 0
def inc():
    global accumulator
    for i in range(1000): accumulator += 1

threads = []
for i in range(1000):
    t = Thread(target=inc)
    t.start()
    threads.append(t)
for t in threads:
    t.join()

print(accumulator)

from threading import Thread

accumulator = []
def inc():
    for i in range(1000):
        accumulator.append(i)

threads = []
for i in range(1000):
    t = Thread(target=inc)
    t.start()
    threads.append(t)

for t in threads:
    t.join()

print(len(accumulator))

One important factor is that Python has the infamous Global Interpreter Lock, or GIL. Python threads can only execute Python code when they hold the GIL, so only one thread can execute Python code at a time. (The GIL may be released when performing I/O or executing computations implemented in C, or even other non-Python FFI languages, e.g. numpy computations.) And it happens that Python threads release the GIL infrequently, nowhere close to after every instruction.

To get a race to show up in the first Python snippet, it helps to prepend these lines to encourage Python threads to release the GIL more often (cf. setswitchinterval):

import sys
sys.setswitchinterval(0.000001)

The GIL also prevents races from happening in the second snippet; while a thread is appending an element to a list, it needs to hold the GIL.

Concurrency Styles

Glyph’s The Concurrency Spectrum lists four styles of “concurrent programming idioms”, which I will attempt to rephrase more briefly and uniformly in terms of threads yielding to each other. That is, we have some processor or virtual machine executing one thread at a time, and it sometimes decides to switch to executing another thread, in which case we say the first thread yields to the second.

Callback-based concurrency: “threads” always run to completion (you probably don’t even call them “threads”). Any asynchronous actions are performed through setting up explicit callbacks, which you can think of as defining a new “thread” to be executed later.
Explicit concurrency: threads yield to each other only at points with a specific programming construct, e.g., a yield or await keyword. Furthermore, these constructs are “colored”² such that functions using them must themselves be called with this construct. As a result, when you read a block of code without such a construct, you know it contains no yield points and will execute atomically without being interrupted by other threads.
Cooperative multithreading: threads yield to each other only at points where specific concurrency primitives are invoked. This differs from explicit concurrency in that these primitives can be called from inside functions that can themselves be called normally.
Preemptive multithreading: threads yield to each other anywhere, unless protected by synchronization primitives.

To expand this list to actually cover the guarantees that multithreaded machine code on modern processors give, I might propose a fifth style:

Nonlinearizable preemptive multithreading: it is not correct to think of threads yielding to each other at all, because the results of execution may not correspond to any globally-consistent order in which the various threads’ instructions were executed. For example, perhaps it’s possible for one thread to execute a = 0; b = 0; a = 1; b = 1 but for another thread to read b == 1 && a == 0, ~~because a happened to live in a per-CPU cache but b happened to live in a shared one.~~ Reader Kåre von Geijer writes that this isn’t how modern processors work because they follow cache coherence; upon reflecting on what I learned from 6.004, this sounds right to me. Instead, the instructions could be reordered by the hardware or compiler in ways that follow the rules.

Just for completeness, I suppose I would also want to start the list this way:

No concurrency: There is only one thread.

Lower numbers are easier to reason about, but higher numbers may offer better performance (if only by being closer to the model of computation one layer of abstraction deeper) and some kinds of flexibility.

Callbacks and Event Loops

There’s a function call_when_thing_happens. You pass it a function. It calls the function when the thing happens. Simple enough, right?

Consider installing a signal handler:

interrupted = False
def handler(num, stack_frame):
    global interrupted
    interrupted = True

import signal
signal.signal(signal.SIGINT, handler)
while not interrupted: pass
print("Interrupted")

This is fine, if a bit silly. Contrast with an event listener in everyday JavaScript:

<button id="button">Click me</button>
<script>
let clicked = false;
function handler() {
  clicked = true;
}
document.getElementById("button").addEventListener("click", handler);
while (!clicked) {}
alert("Clicked");
</script>

(I switched the event kind because signal handling seems much less common in JavaScript, but for anybody spooked by me changing both that and the programming language, you get the same result if you use a Node signal handler — which I was initially surprised to see:)

const process = require("node:process");

let interrupted = false;
process.on("SIGINT", () => {
  interrupted = true;
});

while (!interrupted) {}
console.log("Interrupted");

The JavaScript event listener will never run. One way to think about this is that while not interrupted: pass contains yield points to signal handlers, but while (!clicked) {} doesn’t contain any yield points to event handlers.

How do signal handlers work in Python? (Specifically, CPython?) Note that the answer is different from how signal handlers work in the OS — there is a concrete layer of abstraction intervening: Python installs its own signal handler (which, on Linux-likes, bottoms out to the syscall signal or sigaction), then checks whether that handler has fired at various points between Python bytecode instructions (and sometimes a few other places besides); if it has, the Python signal handler is invoked. So generally, when the Python interpreter is running pure C code that’s halfway through an opcode, it’s not interruptible.

If you have numpy installed, you can try performing a large matrix multiplication and checking that you can’t interrupt it with Ctrl-C:

import numpy as np
np.empty((4000,4000)) @ np.empty((4000,4000))

(I actually had trouble finding an uninterruptible computation — integer exponentiation is a single opcode, and there’s some documentation suggesting long regex matches might spend a long time in C, but both operations were interruptible when I tested them.)

Then, if the Python signal handler is called after you’ve installed a handler for the relevant signal, the Python code in your handler gets executed, and can perhaps modify state that the rest of your Python code can read.

How do event handlers work in JavaScript, or signal handlers in Node? Everything is built around the event loop, which looks something like:

while (queue.waitForMessage()) {
  queue.processNextMessage();
}

A call to addEventListener or process.on tells the browser or Node runtime: when this event or signal happens, append the handler to the queue, but don’t run any code right away. (Usually — the full story is complicated.) But if processing the rest of the current message takes a long time (perhaps forever), the remaining messages/handlers in the queue are simply stuck. Many “green thread” implementations work in very similar ways.

Your reaction to this might be, wow, this event loop is so limiting. Mine was at first. But consider what you’re giving up by allowing your signal handler to run anywhere. For example, you can no longer safely refactor the first snippet below into the second, since the signal handler could run between the two statements.

if interrupted:
    foo, bar = 1, 2
else:
    foo, bar = 3, 4

foo = 1 if interrupted else 3
bar = 2 if interrupted else 4

Less contrivedly, if you let Control-C raise an exception like it does by default, it messes with all kinds of common expectations, such as in try/finally constructs. You might hope that, if foo gets called below, then (barring the entire Python process terminating for some external reason) bar will definitely get called later.

try:
    foo()
finally:
    bar()

But you can’t rely on that, because a KeyboardInterrupt can happen after execution enters the finally: but before bar gets called. (Example is my own mangled adaptation of one from Control-C handling in Python and Trio.)

Finally, relying on an operating system or other host and relying on an outer event loop aren’t the only ways listeners can be implemented. You can have a completely “user-space” listener system, where you add a listener and later invoke it synchronously, possibly through many levels of indirection but without involving any external concurrency system.

let listeners = [];
function addListener(f) { listeners.push(f); }
function invokeListeners() {
  for (let listener of listeners) listener();
}
// later...
addListener(() => console.log("Handled"));
// later...
invokeListeners();

Sometimes you may be working with multiple concurrency systems atop each other, and may need some finesse to “cross layers”. For example, I believe I randomly had to call SwingUtilities.invokeLater() in my old Java GUI code because it used an event loop running in one thread inside a multithreaded system.

Conclusion

Each concurrency style is an abstraction, and can be implemented atop a system providing any other concurrency style. Implementing stricter systems atop looser ones is easy: just add synchronization primitives. Conversely, writing an event loop lets you implement level-1 concurrency in a single-threaded (level-0) system, language features like JavaScript’s async/await let you write what seems to be level-2 concurrent code atop Promise-based level-1 concurrency, and custom VMs implementing custom bytecodes can generally just present any abstraction they want.

It is common for a concurrent system to be built atop another one, but use none of the latter’s concurrency features and instead implement its own concurrency system “from scratch”, usually via an event loop.

To write correct concurrent code, you may not need to know anything but the API or contract provided by the topmost layer of concurrency abstraction. High-level libraries, e.g. web frameworks, might be compatible with multiple different concurrency libraries. But you may sometimes have to write code that touches different layers of concurrency abstraction, or to investigate how multiple layers are affecting code performance, in which case it’s useful to be aware of what all those layers are and how they interact with each other.

Consider e.g. Latency numbers every programmer should know.↩
As made famous by What Color is Your Function? — though many responses have been written to it, in particular disputing whether “color” is actually bad, which any web search will pull up.

(Not to be confused with What Colour are your bits?, which I realized while writing this post is a similar-sounding useful metaphor in computer science, but is totally unrelated.)↩

(note: the commenting setup here is experimental and I may not check my comments often; if you want to tell me something instead of the world, email me!)