[PATCH] sched: account system time properly

From: Eric Dumazet
Date: Wed Apr 29 2009 - 03:51:21 EST


Eric Dumazet a écrit :
> Andrew Morton a écrit :
>> (cc's added)
>>
>> On Thu, 23 Apr 2009 10:19:38 -0400
>> Andrew Gallatin <gallatin@xxxxxxxx> wrote:
>>
>>> When running netperf for some 10GbE tests, I noticed
>>> that IRQ and SOFTIRQ CPU time is no longer reported for an
>>> otherwise idle CPU on recent kernels, at least for x86_64.
>>>
>>> If I take a 2-CPU system, and bind the NIC IRQ to CPU0, and
>>> bind the user-space netserver daemon to CPU1, the problem
>>> is obvious when blasting 10Gb/s of traffic at it. I see
>>> no CPU used for irq or softirq on CPU0, even though it is
>>> handling 13K interrupts/sec:
>>>
>>>
>>>
>>> % mpstat -P 0 1
>>> Linux 2.6.30-rc1 (venice) 04/22/09
>>>
>>> 11:25:25 CPU %user %nice %system %iowait %irq %soft %idle
>>> intr/s
>>> 11:25:26 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00
>>> 13248.00
>>> 11:25:27 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00
>>> 13280.00
>>>
>>> Common sense tells me that is wrong, and oprofile verifies there is
>>> a lot happening on CPU0. Further, when I run a cpu-soaker in
>>> usermode bound to CPU0, I start to see irq, softirq, etc,
>>> being correctly identified:
>>>
>>> 11:28:02 CPU %user %nice %system %iowait %irq %soft %idle
>>> intr/s
>>> 11:28:03 0 45.10 0.00 0.00 0.00 1.96 52.94 0.00
>>> 13019.61
>>> 11:28:04 0 46.46 0.00 0.00 0.00 2.02 51.52 0.00
>>> 13414.14
>>>
>>> The problem is observable, but much less obvious when using a more
>>> common, e1000 1GbE NIC (15% softirq is missing, rather than 50%).
>>>
>>> I spent a few hours git-bisecting until I finally got here:
>>>
>>> % git-bisect bad
>>> Bisecting: 0 revisions left to test after this
>>> [457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4] fix scaled & unscaled cputime
>>> accounting
>>>
>>> I have neither CONFIG_NO_HZ, CONFIG_VIRT_CPU_ACCOUNTING, or XEN configured.
>>>
>> Thanks for doing the bisection.
>>
>> 457533a7d3402d1d91fbc125c8bd1bd16dcd3cd4 was merged late last year, so
>> this regression has been around for a while.
>>
>> We might have fixed it in more recent kernels - stranger things have
>> happened ;)
>>
>
> Same problem here on 32 bit current git kernel, 8 cpus machine.
> Easily reproductible with a ping flood, NIC interrupts bonded to CPU0
> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>
> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 0 0 396260 474568 2881560 0 0 0 4 43642 2859 0 0 100 0
> 0 0 0 396260 474568 2881560 0 0 0 0 55802 2840 0 0 100 0
> 0 0 0 396260 474576 2881560 0 0 0 4 51239 2861 0 0 100 0
> 0 0 0 396260 474576 2881560 0 0 0 0 47848 2847 0 0 100 0
> 0 0 0 396260 474576 2881560 0 0 0 4 54245 2841 0 0 100 0
> 0 0 0 396260 474576 2881560 0 0 0 0 46500 2851 0 0 100 0
> 0 0 0 396260 474576 2881560 0 0 0 4 48035 2839 0 0 100 0
> 0 0 0 396136 474580 2881560 0 0 0 0 41994 2855 0 0 100 0
> 0 0 0 396136 474580 2881560 0 0 0 4 46701 2842 0 0 100 0
>
> If load is large enough to trigger ksoftirqd, then softirq time is correctly reported.
>
> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,100.0%si, 0.0%st
>
> CONFIG_GENERIC_TIME=y
> # CONFIG_GENERIC_TIME_VSYSCALL is not set
> CONFIG_KTIME_SCALAR=y
> CONFIG_TIMERFD=y
> # CONFIG_NO_HZ is not set
> CONFIG_HIGH_RES_TIMERS=y
> CONFIG_HPET_TIMER=y
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> CONFIG_X86_PM_TIMER=y
> # CONFIG_NETFILTER_XT_MATCH_TIME is not set
> CONFIG_SERIAL_8250_RUNTIME_UARTS=4
> # CONFIG_HW_RANDOM_TIMERIOMEM is not set
> # CONFIG_HANGCHECK_TIMER is not set
> # CONFIG_MACHZ_WDT is not set
> CONFIG_PRINTK_TIME=y
> CONFIG_TIMER_STATS=y
>
>
> ping flood received on eth0 / eth1 interfaces (PCI-MSI-edge IRQS)
>
> $ cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 0: 343 0 0 0 0 0 0 0 IO-APIC-edge timer
> 1: 1 0 1 0 0 0 0 0 IO-APIC-edge i8042
> 9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
> 12: 0 0 0 1 1 1 1 1 IO-APIC-edge i8042
> 14: 0 0 0 0 0 0 0 0 IO-APIC-edge ide0
> 15: 0 0 0 0 0 0 0 0 IO-APIC-edge ide1
> 16: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1
> 17: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2
> 18: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3
> 19: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4
> 22: 12 10 13 9 14 14 14 8 IO-APIC-fasteoi uhci_hcd:usb5
> 33: 109191 165558 162520 156506 160469 161549 162655 161780 PCI-MSI-edge cciss0
> 34: 91628623 0 0 0 0 0 0 0 PCI-MSI-edge eth1
> 36: 18517783 0 0 0 0 0 0 0 PCI-MSI-edge eth0
> NMI: 227238100 165960221 154895362 165425712 111786333 116776935 114633666 111315114 Non-maskable interrupts
> LOC: 88434578 80673549 73277767 78134420 77183920 71046805 68905716 70869751 Local timer interrupts
> SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
> RES: 16425 216454 73793 347021 125963 30655 47623 31300 Rescheduling interrupts
> CAL: 13164 13182 13184 13175 100 13169 13151 13163 Function call interrupts
> TLB: 45488 52740 50498 53806 57548 54933 58091 56249 TLB shootdowns
> TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
>
>
>
> commit 79741dd35713ff4f6fd0eafd59fa94e8a4ba922d did :
>
> void account_process_tick(struct task_struct *p, int user_tick)
> {
> cputime_t one_jiffy = jiffies_to_cputime(1);
> cputime_t one_jiffy_scaled = cputime_to_scaled(one_jiffy);
> struct rq *rq = this_rq();
>
> if (user_tick)
> account_user_time(p, one_jiffy, one_jiffy_scaled);
> else if (p != rq->idle)
> account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
> one_jiffy_scaled);
> else
> account_idle_time(one_jiffy);
> }
>
> So, if IRQs are interrupting idle task, I guess if (p != rq->idle) will be false.
>
>

Maybe following patch is needed ?

[PATCH] sched: account system time properly

When idle task is interrupted by an IRQ, time accounting considers CPU is idle,
even while it should account for hard or softirq.

Signed-off-by: Eric Dumazet <dada1@xxxxxxxxxxxxx>

diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..26efa47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick)

if (user_tick)
account_user_time(p, one_jiffy, one_jiffy_scaled);
- else if (p != rq->idle)
+ else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
one_jiffy_scaled);
else


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/