Re: [PATCH] x86_64: fix delayed signals

From: Török Edwin
Date: Sat Jul 12 2008 - 16:27:20 EST


On 2008-07-12 20:29, Linus Torvalds wrote:
> On Sat, 12 Jul 2008, Török Edwin wrote:
>
>> On my 32-bit box (slow disks, SMP, XFS filesystem) 2.6.26-rc9 behaves
>> the same as 2.6.26-rc8, I can reliably reproduce a 2-3 second latency
>> [1] between pressing ^C the first time, and the shell returning (on the
>> text console too).
>> Using ftrace available from tip/master, I see up to 3 seconds of delay
>> between kill_pgrp and detach_pid (and during that time I can press ^C
>> again, leading to 2-3 kill_pgrp calls)
>>
>
> The thing is, it's important to see what happens in between.
>
> In particular, 2-3 second latencies can be entirely _normal_ (although
> obviously very annoying) with most log-based filesystems when they decide
> they have to flush the log.

A bit off-topic, but something I noticed during the tests:
In my original test I have rm-ed the files right after launching dd in
the background, yet it still continued to write to the disk.
I can understand that if the file is opened O_RDWR, you might seek back
and read what you wrote, so Linux needs to actually do the write,
but why does it insist on writing to the disk, on a file opened with
O_WRONLY, after the file itself got unlinked?

> A lot of filesystems are not designed for
> latency - every single filesystem test I have ever seen has always been
> either a throughput test, or a "average random-seek latency" kind of test.
>
> The exact behavior will depend on the filesystem, for example. It will
> also easily depend on things like whether you update 'atime' or not. Many
> ostensibly read-only loads end up writing some data, especially inode
> atimes, and that's when they can get caught up in having to wait for a log
> to flush (to make up for that atime thing).
>

I have my filesystems mounted as noatime already.
But yes, I am using different filesystems, the x86-64 box has reiserfs,
and the x86-32 box has xfs.

> You can try to limit the amount of dirty data in flight by tweaking
> /proc/sys/vm/dirty*ratio

I have these in my /etc/rc.local:
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 >/proc/sys/vm/dirty_ratio


> , but from a latency standpoint the thing that
> actually matters more is often not the amount of dirty data, but the size
> of the requests queues - because you often care about read latency, but if
> you have big requests and especially if you have a load that does lots of
> big _contiguous_ writes (like your 'dd's would do), then what can easily
> happen is that the read ends up being behind a really big write in the
> request queues.
>

I will try tweaking these (and /sys/block/sda/queue/nr_requests that
Arjan suggested).

> And 2-3 second latencies by no means means that each individual IO is 2-3
> seconds long. No - it just means that you ended up having to do multiple
> reads synchronously, and since the reads depended on each other (think a
> pathname lookup - reading each directory entry -> inode -> data), you can
> easily have a single system call causing 5-10 reads (bad cases are _much_
> more, but 5-10 are perfectly normal for even well-behaved things), and now
> if each of those reads end up being behind a fairly big write...
>

I'll try blktrace tomorrow, it should tell me when I/Os are queued /
completed.

>
>> On my 64-bit box (2 disks in raid-0, UP, reiserfs filesystem) 2.6.25 and
>> 2.6.26-rc9 behave the same, and most of the time (10-20 times in a row)
>> find responds to ^C instantly.
>>
>> However in _some_ cases find doesn't respond to ^C for a very long time
>> (~30 seconds), and when this happens I can't do anything else but switch
>> consoles, starting another process (latencytop -d) hangs, and so does
>> any other external command.
>>
>
> Ok, that is definitel not related to signals at all. You're simply stuck
> waiting for IO - or perhaps some fundamental filesystem semaphore which is
> held while some IO needs to be flushed.

AFAICT reiserfs still uses the BKL, could that explain why one I/O
delays another?

> That's why _unrelated_ processes
> hang: they're all waiting for a global resource.
>
> And it may be worse on your other box for any number of reasons: raid
> means, for example, that you have two different levels of queueing, and
> thus effectively your queues are longer. And while raid-0 is better for
> throughput, it's not necessarily at all better for latency. The filesystem
> also makes a difference, as does the amount of dirty data under write-back
> (do you also have more memory in your x86-64 box, for example? That makes
> the kernel do bigger writeback buffers by default)
>
>

Yes, I have more memory on x86-64 (2G), and x86-32 has 1G.
>> I haven't yet tried ftrace on this box, and neither did I try Roland's
>> patch yet. I will try that now, and hopefuly come back with some numbers
>> shortly.
>>
>
> Trust me, roland's patch will make no difference what-so-ever. It's purely
> a per-thread thing, and your behaviour is clearly not per-thread.
>

Indeed.
[Roland's patch is included in tip/master so I actually tried it even if
I didn't know I did]

> Signals are _always_ delayed until non-blocking system calls are done, and
> that means until the end of IO.
>
> This is also why your trace on just 'kill_pgrp' and 'detach_pid' is not
> interesting. It's _normal_ to have a delay between them. It can happen
> because the process blocks (or catches) signals, but it will also happen
> if some system call waits for disk.
>

Is there a way to trace what happens between those 2 functions?
Maybe if I don't use the trace filter (and thus trace all functions),
and modify the kernel sources to start tracing on kill_pgrp, and stop
tracing on detach_pid.
Would that provide useful info?


Best regards,
--Edwin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/