Re: Kernel-managed IRQ affinity (cont)

From: Ming Lei
Date: Thu Jan 09 2020 - 20:28:23 EST


Hello Thomas,

On Thu, Jan 09, 2020 at 09:02:20PM +0100, Thomas Gleixner wrote:
> Ming,
>
> Ming Lei <ming.lei@xxxxxxxxxx> writes:
>
> > On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> >> ... this one seems to be more appealing at least to me.
> >
> > OK, please try the following patch:
> >
> >
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index 6c8512d3be88..0fbcbacd1b29 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -13,6 +13,7 @@ enum hk_flags {
> > HK_FLAG_TICK = (1 << 4),
> > HK_FLAG_DOMAIN = (1 << 5),
> > HK_FLAG_WQ = (1 << 6),
> > + HK_FLAG_MANAGED_IRQ = (1 << 7),
> > };
> >
> > #ifdef CONFIG_CPU_ISOLATION
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index 1753486b440c..0a75a09cc4e8 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -20,6 +20,7 @@
> > #include <linux/sched/task.h>
> > #include <uapi/linux/sched/types.h>
> > #include <linux/task_work.h>
> > +#include <linux/sched/isolation.h>
> >
> > #include "internals.h"
> >
> > @@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> > {
> > struct irq_desc *desc = irq_data_to_desc(data);
> > struct irq_chip *chip = irq_data_get_irq_chip(data);
> > + const struct cpumask *housekeeping_mask =
> > + housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
> > int ret;
> > + cpumask_var_t tmp_mask;
> >
> > if (!chip || !chip->irq_set_affinity)
> > return -EINVAL;
> >
> > - ret = chip->irq_set_affinity(data, mask, force);
> > + if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
> > + return -EINVAL;
>
> That's wrong. This code is called with interrupts disabled, so
> GFP_KERNEL is wrong. And NO, we won't do a GFP_ATOMIC allocation here.

OK, looks desc->lock is held.

>
> > + /*
> > + * Userspace can't change managed irq's affinity, make sure
> > + * that isolated CPU won't be selected as the effective CPU
> > + * if this irq's affinity includes both isolated CPU and
> > + * housekeeping CPU.
> > + *
> > + * This way guarantees that isolated CPU won't be interrupted
> > + * by IO submitted from housekeeping CPU.
> > + */
> > + if (irqd_affinity_is_managed(data) &&
> > + cpumask_intersects(mask, housekeeping_mask))
> > + cpumask_and(tmp_mask, mask, housekeeping_mask);
>
> This is duct tape engineering with absolutely no semantics. I can't even
> figure out the intent of this 'managed_irq' parameter.

The intent is to isolate the specified CPUs from handling managed interrupt.

For non-managed interrupt, the isolation is done via userspace because
userspace is allowed to change non-manage interrupt's affinity.

>
> If the intent is to keep managed device interrupts away from isolated
> cores then you really want to do that when the interrupts are spread and
> not in the middle of the affinity setter code.
>
> But first you need to define how that mask should work:
>
> 1) Exclude CPUs from managed interrupt spreading completely
>
> 2) Exclude CPUs only when the resulting spreading contains
> housekeeping CPUs
>
> 3) Whatever ...

We can do that. The big problem is that the RT case can't guarantee that
IO won't be submitted from isolated CPU always. blk-mq's queue mapping
relies on the setup affinity, so un-known behavior(kernel crash, or io
hang, or other) may be caused if we exclude isolated CPUs from interrupt
affinity.

That is why I try to exclude isolated CPUs from interrupt effective affinity,
turns out the approach is simple and doable.


Thanks,
Ming