Task scheduler wakeups
This page documents the sleep/wake handshake used by Julia's task scheduler and,
in particular, why jl_wakeup_threadpool may wake only a single worker per
multiqueue insert instead of broadcasting to every thread.
The problem
Every @spawn that lands in the multiqueue used to call jl_wakeup_thread(-1),
which loops over all threads and issues a uv_cond_signal to each sleeping
one. On an oversubscribed machine a burst of spawns therefore produces an
$O(nthreads)$ "wake storm" per insert (JuliaLang/julia#61820,
JuliaLang/julia#50425): every idle worker is woken, races for the one new task,
and all but one immediately go back to sleep.
jl_wakeup_threadpool(tpid) replaces that broadcast. A multiqueue insert wakes
at most one sleeping worker in the task's pool. A burst of $N$ inserts wakes
up to $N$ workers. There is no peer-to-peer cascade: a woken worker simply
pops its task and runs it.
Why waking one worker is enough
Correctness rests on the same store-buffering fence ([^store_buffering_1] in
src/scheduler.c)
that the broadcast path relied on. The relevant per-thread invariant is:
A worker is observed
sleepingby the waker, or it has not yet committed to its sleep transition and is therefore guaranteed to re-check its queue and find the freshly-inserted task.
The consumer's sleep transition in jl_task_get_next is, in order:
publish
sleep_check_state = sleeping,jl_fence(),re-check the queue (
check_empty); abort the sleep if work appeared,decrement
n_threads_running,park on the condition variable while
may_sleepholds.
The enqueuer fences after the insert and then inspects each candidate's
sleep_check_state. Because the consumer publishes sleeping in step 1 — before
it re-checks the queue in step 3 — the enqueuer either sees sleeping (and wakes
the worker) or the worker will itself observe the new task in step 3. Either way
the task is serviced.
Why the wake must not be skipped using n_threads_running
A tempting optimization is to skip the scan entirely when
n_threads_running >= jl_n_threads ("everything is already running, nothing is
parked"). This is unsound, for two independent reasons:
The count lags the per-thread state.
n_threads_runningis decremented in step 4, after the consumer has already re-checked its queue in step 3. In the window between steps 1 and 4 a worker has publishedsleepingand may have already read its queue as empty, yet is still counted as running. An enqueuer that consults the count therefore reads stale "all running" information and skips a wake the worker actually needs.The count is global but the wake is pool-local. Busy workers in one pool keep
n_threads_runninghigh while another pool is entirely parked. A cross-pool insert (e.g. spawning an:interactivetask from a:defaultworker) would then be dropped, stranding the task indefinitely.
Scanning sleep_check_state avoids both problems: that flag is set in step 1,
before the re-check, so a scan never misses a worker in the danger window, and
it is inspected per-pool.
TLA+ model
The directory scheduler-wakeup/
contains a small TLA+ model of this handshake that makes the argument above
machine-checkable.
SchedulerWake.tlamodels the sleep transition as the same discrete steps listed above, so the model checker explores every interleaving of that window against an enqueuer that wakes one worker by scanningsleep_check_state.MCFixed— a concrete instance (two workers sharing a pool, plus a cross-pool producer). TLC explores the full state space with no deadlock and noNoLostWakeupviolation.
To reproduce, with tla2tools.jar:
cd doc/src/devdocs/scheduler-wakeup
java -cp tla2tools.jar tlc2.TLC -config MCFixed.cfg MCFixed.tlaToggling the model back to the unsound n_threads_running short-circuit (by
making Wakeup return early when nrun >= N) makes TLC report a NoLostWakeup
violation, confirming the danger window described above is real.