pgsql-hackers
❮
[PATCH] Fix ProcKill lock-group vs procLatch recycle race
- Jump to comment-1Vlad Lesin<vladlesin@gmail.com>Apr 27, 2026, 8:15 AM UTCHello all,
The following are the patches and demonstration material for a concurrency bug in ProcKill() where a lock-group leader and a member can exit in parallel.
Problem------------------------------------------------------------------------
If a leader detaches from the lock group under leader_lwlock but------------------------------------------------------------------------
has not yet reached DisownLatch(&MyProc->procLatch), a concurrent
last follower can still put the leader PGPROC on a free list, or
the leader and the follower can make inconsistent decisions about
who returns which PGPROC, so that a slot is linked into the free
list with procLatch still owned, or is pushed twice. A new backend
that recycles the slot can then hit:
A concrete interleaving (lock group leader vs last member)PANIC: latch already owned by PID ...
is the following(PG15 code).
=== Lock group leader ===
L1: LWLockAcquire(leader_lwlock)
L2: dlist_delete(&MyProc->lockGroupLink)
L3: dlistisempty → false (the follower still in)The list contains follower.
L4: else if (leader != MyProc) → false, do nothing
L5: LWLockRelease(leader_lwlock)
L6: SwitchBackToLocalLatch()*lwlock released* >>>WINDOW OPEN HERE<<<
L7: pgstatresetwaiteventstorage()
L8: proc = MyProc; MyProc = NULL;
L9: DisownLatch(&proc->procLatch)
L10: SpinLockAcquire(ProcStructLock)*only here owner_pid = 0*
L11: if (proc->lockGroupLeader == NULL) → false, skip
L12: SpinLockRelease(ProcStructLock)
===========================
=== What the last lock group member does in that WINDOW ===
M1: LWLockAcquire(leader_lwlock)
M2: dlist_delete(&MyProc->lockGroupLink)Can proceed as soon as L5 releases it
M3: dlistisempty → trueThe list becomes EMPTY
M4: leader->lockGroupLeader = NULL
M5: leader != MyProc → trueFlip leader's "don't free me" flag
M6: SpinLockAcquire(ProcStructLock)
M7: push leader's PGPROC onto freelist
M8: SpinLockRelease(ProcStructLock)!!! while leader's latch is still owned !!!
M9: LWLockRelease(leader_lwlock)
===========================================================
At M7, the leader's PGPROC is back on freeProcs with
procLatch.ownerpid == leaderpid (non-zero). The leader is still
sitting between L5 and L9.
=== The new backend picks it up in InitProcess() ===
B1: SpinLockAcquire(ProcStructLock)
B2: MyProc = *procgloballist
B3: *procgloballist = MyProc->links.nextpops the leader's PGPROC
B4: SpinLockRelease(ProcStructLock)
B5: ...
B6: OwnLatch(&MyProc->procLatch)
=====================================================→ owner_pid != 0 → elog(PANIC, "latch already owned by PID %d", owner_pid);
Note that all PG versions since 9.6 are affected.
The fix is to run SwitchBackToLocalLatch(),
pgstatresetwaiteventstorage() and DisownLatch() before the
lock-group block, and to decide under leader_lwlock
(pushleader / pushself) with a single freelist update section at
the end. This matches what we implemented; detailed commentary
is in the commit messages and in the patch.
Mainline parallel query usually avoids the race: the leader is not
expected to reach ProcKill with another group member still in play
the way some extension-driven workloads can be.
Testing------------------------------------------------------------------------
Technically, a regression test should be included. However, developing one is non-trivial given the current PG18 injection point implementation. Because ProcKill() partially de-initializes data necessary for injection points to function, a significant workaround would be required. Given the complexity of such a workaround, I would like to discuss whether a test is mandatory for this patch.------------------------------------------------------------------------
I have provided versions without the test (0001, 0002) as well as a test that uses an injection point workaround to reproduce the bug deterministically(0003). If a test for the bug fix is required, please note that I can only provide it for PG17+, as earlier versions do not support injection points.
Attached patches------------------------------------------------------------------------------------------------------------------------------------------------
--1) 0001-ProcKill-REL15-lockgroup-freelist-race.patch Core proc.c for REL_15_STABLE. No TAP: PG15 has no suitable injection support for a core reproducer; fix-only for that line. 2) 0002-ProcKill-PG18-core-lockgroup-freelist-race.patch Same *core* fix for PG18: proc.c only. 3) 0003-PG18-unfixed-repro-tap-injection-harness.patch *Demonstration only* Shows the PANIC; not the shape to land after the fix without replacing the test.
Best regards,
Vlad