Re: [Bugme-new] [Bug 15709] New: swapper page allocation failure

From: Robert Wimmer
Date: Thu May 13 2010 - 17:09:27 EST


Finally I've had some time to do the next test.
Here is a wireshark dump (~750 MByte):
http://213.252.12.93/2.6.34-rc5.cap.gz

dmesg output after page allocation failure:
https://bugzilla.kernel.org/attachment.cgi?id=26371

stack trace before page allocation failure:
https://bugzilla.kernel.org/attachment.cgi?id=26369

stack trace after page allocation failure:
https://bugzilla.kernel.org/attachment.cgi?id=26370

I hope the wireshark dump is not to big to download.
It was created with
tshark -f "tcp port 2049" -i eth0 -w 2.6.34-rc5.cap

Thanks!
Robert



On 05/06/10 23:30, Trond Myklebust wrote:
> Sorry. I've been caught up in work in the past few days.
>
> I can certainly help with the soft lockup if you are able to supply
> either a dump that includes all threads stuck in the NFS, or a (binary)
> wireshark dump that shows the NFSv4 traffic between the client and
> server around the time of the hang.
>
> Cheers
> Trond
>
> On Thu, 2010-05-06 at 23:19 +0200, Robert Wimmer wrote:
>
>> I don't know if someone is still interested in this
>> but I think Trond isn't further interested because
>> the last error was of cource a "page allocation
>> failure" and not a "soft lookup" which Trond was
>> trying to solve. But the patch was for 2.6.34 and
>> the "soft lookup" comes up only with some 2.6.30 and
>> maybe some 2.6.31 kernel versions. But the first error
>> I reported was a "page allocation failure" which
>> all kernels >= 2.6.32 produces with this configuration
>> I use (NFSv4).
>>
>> Michael suggested to first solve the "soft lookup"
>> before further investigating the "page allocation
>> failure". We know that the "soft lookup" only
>> pop's up with NFSv4 and not v3. I really want to
>> use v4 but since I'm not a kernel hacker someone
>> must guide me what to try next.
>>
>> I know that you're all have a lot of other work to
>> do but if there're no ideas left what to do next
>> it's maybe best to close the bug for now and I stay with
>> kernel 2.6.30 for now or go back to NFS v3 if I
>> upgrade to a newer kernel. Maybe the error will
>> be fixed "by accident" in >= 2.6.35 ;-)
>>
>> Thanks!
>> Robert
>>
>>
>>
>> On 05/03/10 10:11, kernel@xxxxxxxxxxx wrote:
>>
>>> Anything we can do to investigate this further?
>>>
>>> Thanks!
>>> Robert
>>>
>>>
>>> On Wed, 28 Apr 2010 00:56:01 +0200, Robert Wimmer <kernel@xxxxxxxxxxx>
>>> wrote:
>>>
>>>
>>>> I've applied the patch against the kernel which I got
>>>> from "git clone ...." resulted in a kernel 2.6.34-rc5.
>>>>
>>>> The stack trace after mounting NFS is here:
>>>> https://bugzilla.kernel.org/attachment.cgi?id=26166
>>>> /var/log/messages after soft lockup:
>>>> https://bugzilla.kernel.org/attachment.cgi?id=26167
>>>>
>>>> I hope that there is any usefull information in there.
>>>>
>>>> Thanks!
>>>> Robert
>>>>
>>>> On 04/27/10 01:28, Trond Myklebust wrote:
>>>>
>>>>
>>>>> On Tue, 2010-04-27 at 00:18 +0200, Robert Wimmer wrote:
>>>>>
>>>>>
>>>>>
>>>>>>> Sure. In addition to what you did above, please do
>>>>>>>
>>>>>>> mount -t debugfs none /sys/kernel/debug
>>>>>>>
>>>>>>> and then cat the contents of the pseudofile at
>>>>>>>
>>>>>>> /sys/kernel/debug/tracing/stack_trace
>>>>>>>
>>>>>>> Please do this more or less immediately after you've finished
>>>>>>>
>>>>>>>
>>> mounting
>>>
>>>
>>>>>>> the NFSv4 client.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I've uploaded the stack trace. It was generated
>>>>>> directly after mounting. Here are the stacks:
>>>>>>
>>>>>> After mounting:
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26153
>>>>>> After the soft lockup:
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26154
>>>>>> The dmesg output of the soft lockup:
>>>>>> https://bugzilla.kernel.org/attachment.cgi?id=26155
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Does your server have the 'crossmnt' or 'nohide' flags set, or does
>>>>>>>
>>>>>>>
>>> it
>>>
>>>
>>>>>>> use the 'refer' export option anywhere? If so, then we might have to
>>>>>>> test further, since those may trigger the NFSv4 submount feature.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> The server has the following settings:
>>>>>> rw,nohide,insecure,async,no_subtree_check,no_root_squash
>>>>>>
>>>>>> Thanks!
>>>>>> Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> That second trace is more than 5.5K deep, more than half of which is
>>>>> socket overhead :-(((.
>>>>>
>>>>> The process stack does not appear to have overflowed, however that
>>>>>
>>>>>
>>> trace
>>>
>>>
>>>>> doesn't include any IRQ stack overhead.
>>>>>
>>>>> OK... So what happens if we get rid of half of that trace by forcing
>>>>> asynchronous tasks such as this to run entirely in rpciod instead of
>>>>> first trying to run in the process context?
>>>>>
>>>>> See the attachment...
>>>>>
>>>>>
>>>>>
>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/