Re: [PATCH v2] arm64: gic: increase the number of IRQ descriptors

From: Thomas Gleixner
Date: Tue Jan 10 2023 - 17:36:25 EST


Shanker!

On Tue, Jan 10 2023 at 08:22, Shanker Donthineni wrote:
> On 1/10/23 02:20, Marc Zyngier wrote:
>> I think you should address the problem the other way around, as there
>> are lower hanging fruits:
>>
>> - turn the irq_desc_tree radix tree into a XArray
>>
>> - use the XArray mark feature to reimplement the irqs_resend bitmap

and then you go and do:

> genirq: Use IDR API for Linux-IRQ IDs allocation

But let me look at your preparation patch first:

> +static void irq_free_descs_ids(unsigned int from, unsigned int cnt)
> +{
> + bitmap_clear(allocated_irqs, from, cnt);
> +}
> +
> +static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt)
> +{
> + return bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
> + from, cnt, 0);

This is a complete misnomer simply because this does not allocate
anything. It finds an appropriate sized empty area.

The actual bitmap update happens later which you then remove in the
second patch:

> - bitmap_set(allocated_irqs, start, cnt);

thereby breaking SPARSEIRQ=n configs....

> +}
> +
> +static unsigned int irq_get_next_id(unsigned int offset)
> +{
> + return find_next_bit(allocated_irqs, nr_irqs, offset);
> +}

That's a misnomer too. This is not about getting an arbitrary next ID
starting at @offset. This is about finding the next allocated interrupt
number starting at @offset.

Naming matters. This code is hard enough to read already. No need for
further confusion.

> The build time config paramter IRQ_BITMAP_BITS (NR_IRQS+8196)
> may not be sufficient for some architectures. The interrupt ID
> sparse is huge for ARM-GIC architecture ~32 bits. Static bitmap
> memory for managing IDs is not optimal when NR_IRQS is set to
> a high value.
>
> It uses the IDR API for the IRQ ID allocation/deallocation and
> its descriptors management insertion/deletion/search. No other
> references to macro IRQ_BITMAP_BITS hence remove it.

Changelogs should tell the WHY and not the WHAT. I can see that it uses
IDR from the patch, but there is _ZERO_ justification why IDR is the
right choice for this.

> static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt)
> {
> - return bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
> - from, cnt, 0);
> + int start, id, i;
> +
> + do {
> + /* Allocate starting ID */
> + start = idr_alloc(&idr_irq_descs, NULL, from, 0, GFP_ATOMIC);

Why does this require GFP_ATOMIC? The allocation is serialized by a
mutex and is fully preemptible.

Can you find a single GPF_ATOMIC in the irqdesc code?

If you had at least read through the changelogs of that file you would
have found a series of commits which worked towards making the irqdesc
allocation use GFP_KERNEL. But sure, it's way simpler to throw GFP_ATOMIC
at the code just because...

> + if (start < 0)
> + return start;
> + idr_set_cursor(&idr_irq_descs, start);
> +
> + /* Allocate contiguous IDs */
> + for (i = 1; i < cnt; i++) {
> + id = idr_alloc_cyclic(&idr_irq_descs, NULL, start + i,
> + start + i, GFP_ATOMIC);
> + if (id < 0) {
> + irq_free_descs_ids(from, i);

So if there is not enough room, then you start over. *Shudder*

Just assume a halfways dense populated IDR with tons of small holes and
then try to allocate 128 MSI vectors. That'll take ages...

You can simply use a maple_tree for this.

static MTREE_INIT_EXT(sparse_irqs, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN,
sparse_irq_lock);

And the functions become:

static int irq_find_free_area(unsigned int from, unsigned int cnt)
{
MA_STATE(mas, &sparse_irqs, 0, 0);

if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
return -ENOSPC;
return mas.index;
}

static unsigned int irq_find_next_irq(unsigned int offset)
{
MA_STATE(mas, &sparse_irqs, offset, nr_irqs);
struct irq_desc *desc = mas_next(&mas, nr_irqs);

return desc ? irq_desc_get_irq(desc) : nr_irqs;
}

static int irq_insert_desc(irq, desc)
{
MA_STATE(mas, @sparse_irqs, irq, irq);

return mas_store_gfp(&mas, desc, GFP_KERNEL);
}

static void irq_remove_desc(irq)
{
MA_STATE(mas, @sparse_irqs, irq, irq);

return mas_erase(&mas);
}

or something like that.

Coming back to SPARSEIRQ=n.

I'm more than tempted to take this opportunity to get rid of this
distinction. There is no real reason to have the duplicated code. We can
simply get rid of the statically allocated irq descriptor arrays and
just do the preallocation in early_irq_init().

Now for the pending bits:

> ... The irq_resend change will be simple, IDR will fit perfectly.

You wish...

> /* Set it pending and activate the softirq: */
> - set_bit(irq, irqs_resend);
> + id = idr_alloc(&irqs_resend, desc, 0, 0, GFP_ATOMIC);

This breaks PREEMPT_RT as this code runs under a raw spinlock with
interrupts and preemption disabled and _cannot_ do any allocations.

Again, the changelogs of the interrupt code contain enough information
to figure these things out. But sure it's simpler to throw some half
baken stuff at the kernel and see what sticks...

Marc's suggestion to utilize XARRAY and the mark feature would trivialy
avoid this because there is no allocation required in that code
path. The descriptor already exists in the XARRAY.

But that can't work either on PREEMPT_RT because for setting the mark
the xarray code needs to acquire xarray::xa_lock which is a regular
spinlock, which nest inside of a raw spinlock.

So this needs a completely different approach. Let's look at the
functionality of the resend code:

It's a crutch which tries to handle the inability of (legacy)
interrupt chips to reinject interrupts at the hardware level.

There is absolutely no reason to care about performance for that, but
using IDR (or anything like that) instead of the bitmap is just
hillarious.

So what else can be done? The obvious:

static DEFINE_RAW_SPINLOCK(irq_resend_lock);
static struct hlist_head irq_resend_list;

static int irq_sw_resend(struct irq_desc *desc)
{
....
raw_spin_lock(&irq_resend_lock);
hlist_add_head(&desc->resend_node, &irq_resend_list);
raw_spin_lock(&irq_resend_lock);
tasklet_schedule(&resend_tasklet);
}

and the resend side:

static void resend_irqs(struct tasklet_struct *unused)
{
struct irq_desc *desc;
int irq;

raw_spin_lock_irq(&irq_resend_lock);
while (!hlist_empty(&irqs_resend_list)) {
desc = hlist_entry(irqs_resend_list.first, ....);
hlist_del_init(&desc->resend_node);
desc->handle_irq(desc);
}
raw_spin_unlock_irq(&irq_resend_lock);
}

Plus the proper mechanics to handle the hlist entry when an interrupt is
torn down, which is not rocket science either.

Thanks,

tglx