Re: high system cpu load during intense disk i/o

From: Dimitrios Apostolou
Date: Thu Aug 09 2007 - 04:21:45 EST


Hi Rafal, thank you for your help!

On Wednesday 08 August 2007 22:08:18 Rafał Bilski wrote:
> > Hello again,
>
> Hi,
>
> > I 'm now using libata on the same system described before (see attached
> > dmesg.txt). When writing to both disks I think the problem is now worse
> > (pata_oprof_bad.txt, pata_vmstat_bad.txt), even the oprofile script
> > needed half an hour to complete! For completion I also attach the same
> > tests when I write to only one disk (pata_vmstat_1disk.txt,
> > pata_oprof_1disk.txt), whence everything is normal.
> >
> > FWIW, libata did not give me any performance benefit, 20MB/s is again the
> > peak hdparm reports.
>
> This OProfile thing is extremly not usefull in this case. It says that Your
> system is using 25% CPU time for no-op loops, but it doesn't say why. Your
> system really isn't very busy. Look here:
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu---- r b swpd free buff cache si so bi bo in
> cs us sy id wa 2 2 0 225352 5604 91700 0 0 18112 1664 28145
> 6315 29 71 0 0 2 2 0 225360 5604 91700 0 0 18496 1664
> 27992 6358 30 70 0 0 1 2 0 225360 5604 91700 0 0 18432
> 1472 28511 6315 28 72 0 0 1 2 0 225360 5604 91700 0 0
> 18240 1536 28031 6153 31 69 0 0 + video 720x576 25fps yuv stream over
> PCI. And system is fully responsive. Of course programs which need disk
> access must wait a bit longer, but later are running fine.
> I don't have disks so fast like Yours and I can't do destructive write
> test. First disk:
> 1 1 0 241848 7312 100768 0 0 27712 0 927 1270 29 13 0
> 58 1 1 0 241052 7580 100896 0 0 4612 4676 519 702 34 12
> 0 54 Second disk:
> 0 1 0 237752 7268 100980 0 0 6464 0 468 583 37 10 0
> 53 0 1 0 241060 7532 100884 0 0 1728 1728 465 578 31 9
> 0 60 Both:
> 0 2 0 241592 7384 100776 0 0 33024 0 905 1415 33 16 0
> 51 1 2 0 240804 7528 100884 0 0 6848 6848 642 780 38 10
> 0 52 So sda + sdc = both.
>
> Your single disk:
> 0 1 0 128804 19620 82484 0 0 0 21120 335 675 0 4 0
> 96 Both:
> 5 2 0 168480 10972 47152 0 0 0 16000 252 470 22 78 0
> 0 I would expect 2*21k, but we have only 2*8k and it is lower then single
> disk case. Of course this math isn't moving us forward. Only thing which
> would help would be function call trace as Andrew wrote. Which function is
> calling delay_tsc()? Is it calling it often or once but with long delay?

Please guide me on doing these call traces. Unfortunately I don't have any
experience at all regarding kernel-space debugging. What tools should I use?
Keep in mind also that I mostly have remote access to this PC.

However I am not really confident that the cause is delay_tsc(). It only
appeared on the libata tests. To summarize I will write down the first lines
of all problematic oprofilings I've done until now:

first test:
3832 23.4818 __switch_to
3380 20.7121 schedule
653 4.0015 mask_and_ack_8259A
452 2.7698 __blockdev_direct_IO
327 2.0038 dequeue_task

second test, with idle=poll (which didn't really help, it seems the cpu is not
idling at all):
2477 31.2871 __switch_to
1625 20.5255 schedule
246 3.1072 mask_and_ack_8259A
222 2.8041 __blockdev_direct_IO
147 1.8568 follow_page

third test, libata and idle=poll (the problem felt like it was greater here,
oprofiling took a really long time):
9556 26.0523 delay_tsc
6424 17.5136 iowrite8
6178 16.8430 __switch_to
4015 10.9460 schedule
3335 9.0921 ioread8
1429 3.8959 iowrite32


So I would assume that delay_tsc() probably only makes the situation worse for
the libata tests, but the real problem is at __switch_to() and schedule(). Do
you agree with these assumptions? What are these functions used for? My guess
would be that __switch_to() is for context switching and schedule() for
process scheduling. However the context switch rate, as reported by vmstat,
is not that great to verify the time lost in __switch_to(). As shown from the
measurements on one disk the context switching rate is greater, but the time
spent is much less.

Is there a way to for oprofile to report the time spent in function calls
depending on the call trace?

> So far it looks like some kind of hardware limit for me. Do You have any
> options in BIOS which can degrade PCI or disk performance?

None that I can think of. Do you have any specific options in mind?


Thanks again for the help. I guess if that doesn't lead anywhere I'll just
start compiling older vanilla kernels and see when the problem dissapears.
But this needs a lot of time and I'm not sure for how long I'll be able to
not offer any service with those disks (I was supposed to use RAID 0 to
provide storage space with them, but with the current problems that wouldn't
be so wise).

Dimitris

>
> > Thanks,
> > Dimitris
>
> Rafał
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/