Re: [PATCH 0/3] fix interrupt swamp in NVMe

From: John Garry
Date: Tue Aug 20 2019 - 04:59:55 EST


On 20/08/2019 09:25, Ming Lei wrote:
On Tue, Aug 20, 2019 at 2:14 PM <longli@xxxxxxxxxxxxxxxxx> wrote:

From: Long Li <longli@xxxxxxxxxxxxx>

This patch set tries to fix interrupt swamp in NVMe devices.

On large systems with many CPUs, a number of CPUs may share one NVMe hardware
queue. It may have this situation where several CPUs are issuing I/Os, and
all the I/Os are returned on the CPU where the hardware queue is bound to.
This may result in that CPU swamped by interrupts and stay in interrupt mode
for extended time while other CPUs continue to issue I/O. This can trigger
Watchdog and RCU timeout, and make the system unresponsive.

This patch set addresses this by enforcing scheduling and throttling I/O when
CPU is starved in this situation.

Long Li (3):
sched: define a function to report the number of context switches on a
CPU
sched: export idle_cpu()
nvme: complete request in work queue on CPU with flooded interrupts

drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 1 +
include/linux/sched.h | 2 ++
kernel/sched/core.c | 7 +++++
4 files changed, 66 insertions(+), 1 deletion(-)

Another simpler solution may be to complete request in threaded interrupt
handler for this case. Meantime allow scheduler to run the interrupt thread
handler on CPUs specified by the irq affinity mask, which was discussed by
the following link:

https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d056383e@xxxxxxxxxx/

Could you try the above solution and see if the lockup can be avoided?
John Garry
should have workable patch.

Yeah, so we experimented with changing the interrupt handling in the SCSI driver I maintain to use a threaded handler IRQ handler plus patch below, and saw a significant throughput boost:

--->8

Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity

Currently the cpu allowed mask for the threaded part of a threaded irq
handler will be set to the effective affinity of the hard irq.

Typically the effective affinity of the hard irq will be for a single
cpu. As such, the threaded handler would always run on the same cpu as
the hard irq.

We have seen scenarios in high data-rate throughput testing that the
cpu handling the interrupt can be totally saturated handling both the
hard interrupt and threaded handler parts, limiting throughput.

Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
interrupt to decide on the policy of which cpu the threaded handler
may run.

Signed-off-by: John Garry <john.garry@xxxxxxxxxx>

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a99b2a..48e8b955989a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
* interrupt handler after suspending interrupts. For system
* wakeup devices users need to implement wakeup detection in
* their interrupt handlers.
+ * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu
+ * allowed mask for the threaded handler of a threaded interrupt
+ * handler, rather than the effective hard irq affinity.
*/
#define IRQF_SHARED 0x00000080
#define IRQF_PROBE_SHARED 0x00000100
@@ -74,6 +77,7 @@
#define IRQF_NO_THREAD 0x00010000
#define IRQF_EARLY_RESUME 0x00020000
#define IRQF_COND_SUSPEND 0x00040000
+#define IRQF_IRQ_AFFINITY 0x00080000

#define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..cb483a055512 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
* mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
*/
if (cpumask_available(desc->irq_common_data.affinity)) {
+ struct irq_data *irq_data = &desc->irq_data;
const struct cpumask *m;

- m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+ if (action->flags & IRQF_IRQ_AFFINITY)
+ m = desc->irq_common_data.affinity;
+ else
+ m = irq_data_get_effective_affinity_mask(irq_data);
cpumask_copy(mask, m);
} else {
valid = false;
--
2.17.1

As Ming mentioned in that same thread, we could even make this policy for managed interrupts.

Cheers,
John


Thanks,
Ming Lei

.