Re: [PATCH v2] Document /proc/pid PID reuse behavior

From: Matthew Wilcox
Date: Thu Nov 08 2018 - 07:28:00 EST


On Thu, Nov 08, 2018 at 12:02:49PM +0000, David Laight wrote:
> From: Martin Steigerwald
> > Sent: 07 November 2018 17:05
> ...
> > Its not quite on-topic, but I am curious now: AFAIK PID limit is 16
> > bits. Right? Could it be raised to 32 bits? I bet it would be a major
> > change throughout different parts of the kernel.
>
> It is probably 15 bits (since -ve pid numbers are used for process groups).
>
> My guess is that userspace and the system call interface will handle 32bit
> (signed) pid numbers.
> (I don't remember 'linux emulation' being one of the emulations that
> would truncate 32bit pids when one of the BDSs went to 32bit pids.)
> The main problem will be that big numbers will mess up the alignment
> of printouts from ps and top (etc).
> This can be mitigated by only allocating 'big' numbers on systems
> that have a lot of pids.
> You also really want an O(1) allocator.

The allocator is O(log n) -- it's the IDR allocator, used in cyclic mode.
n in this case is the highest ID which is still in use. The tree is
log_64(n) levels high. It walks to the bottom of the tree and puts a
pointer into the tree. If the cursor has wrapped to the beginning of
the tree, it may encounter a PID which is still in use; if it does,
it does a bitmap scan of that node, and will then walk up the tree,
doing a bitmap scan forward at each level until it finds a free PID.

So it's not exactly O(log(n)), but it's close enough for all practical
purposes. And more importantly, it doesn't touch a lot of cachelines.
Two or three at each level of the tree that it accesses. If we went
all the way to a 32-bit PID, the tree would grow to 6 levels deep,
and worst-case would touch 6 + 5 + 4 levels of the tree (starting with
trying to allocate PID 0xffffffff, failing, trying to allocate PID 300,
then having to walk all the way forward to find PID 0xe0000000), so
that's only 45 cachelines.

People care a little too much about O(1)/O(n) behaviour. Cacheline
behaviour, and good average-case performance without falling off a cliff
in the worst case is much more important.