Re: [uml-devel] BUG: soft lockup for a user mode linux image

From: Toralf FÃrster
Date: Tue Oct 08 2013 - 15:57:03 EST


Well, the quick&dirty hack below at least works for the moment to
overcome the soft lookup and the hang/unresponsiveness of the 32 bit
user mode linux guest :


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f5236f8..7e9483c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1503,6 +1503,8 @@ static void balance_dirty_pages(struct
address_space *mapping,
}

pause:
+ if (pause < 0)
+ break;
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,



I'm not proud of it but after starring at the source code in
mm/page-writeback.c too often and too long currently I don't have any
better clue.

WRT to debug of the culprit: neither printk nor friends worked (maybe
b/c the affected process is just hanging ?) and BUG_ON doesn't gave me
any new clues.


On 10/06/2013 10:26 PM, Geert Uytterhoeven wrote:
> On Sun, Oct 6, 2013 at 10:08 PM, Toralf FÃrster <toralf.foerster@xxxxxx> wrote:
>> On 10/06/2013 08:38 PM, Geert Uytterhoeven wrote:
>>> On Sun, Oct 6, 2013 at 4:17 PM, Toralf FÃrster <toralf.foerster@xxxxxx> wrote:
>>>> The UML stopped here :
>>>> ...
>>>> if (unlikely(task_ratelimit == 0)) {
>>>> period = max_pause;
>>>> pause = max_pause;
>>>> BUG_ON(pause < 0);
>>>> goto pause;
>>>> }
>>>> BUG_ON(pages_dirtied < 0);
>>>> BUG_ON(task_ratelimit < 0);
>>>> period = HZ * pages_dirtied / task_ratelimit;
>>>> BUG_ON(period < 0); <----------------------here
>>>
>>> So pages_dirtied becomes that big compared to task_ratelimit (both are
>>> "unsigned long"), that period (which is "long", just like "pause") overflows
>>> into a negative number.
>>>
>>> This is indeed much more likely to happen on 32-bit.
>>>
>>>> The back trace is :
>>>
>>>> #9 0x08411c64 in balance_dirty_pages (pages_dirtied=9, mapping=<optimized out>) at mm/page-writeback.c:1471
>>>
>>> But here pages_dirtied is only 9??
>
>> Well, this points to an overflow or ? :
>
> Negative indicates an overflow, but pages_dirtied doesn't.
>
>> tfoerste@n22 ~/devel/linux $ nl -ba mm/page-writeback.c | grep -A 5 -B 5 1468
>> 1463 BUG_ON(pause < 0);
>> 1464 goto pause;
>> 1465 }
>> 1466 period = HZ * pages_dirtied / task_ratelimit;
>> 1467 pause = period;
>> 1468 BUG_ON(pause < 0 && pages_dirtied > 0 && task_ratelimit > 0);
>> 1469 if (current->dirty_paused_when)
>> 1470 pause -= now - current->dirty_paused_when;
>> 1471 /*
>> 1472 * For less than 1s think time (ext3/4 may block the dirtier
>> 1473 * for up to 800ms from time to time on 1-HDD; so does xfs,
>>
>>
>> and the back trace is :
>>
>> #9 0x08411c6c in balance_dirty_pages (pages_dirtied=0, mapping=<optimized out>) at mm/page-writeback.c:1468
>
> Hmm, now pages_dirtied is zero, according to the backtrace, but the BUG_ON()
> asserts its strict positive?!?
>
> Can you please try the following instead of the BUG_ON():
>
> if (pause < 0) {
> printk("pages_dirtied = %lu\n", pages_dirtied);
> printk("task_ratelimit = %lu\n", task_ratelimit);
> printk("pause = %ld\n", pause);
> }
>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
>


--
MfG/Sincerely
Toralf FÃrster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/