Re: Process Hang in __read_seqcount_begin

From: Peter LaDow
Date: Tue Oct 30 2012 - 19:09:16 EST


Ok. More of an update. We've managed to create a scenario that
exhibits the problem much earlier. We can now cause the lockup to
occur within a few hours (rather than the 12 to 24 hours in our other
scenario).

Our setup is to to have a a lot of traffic constantly being processed
by the netfilter code. After about 2 hours, any external attempt to
read the table entries (such as with getsocktopt and
IPT_SO_GET_ENTRIES) triggers the lockup. What is strange is that this
does not appear until after a couple of hours of heavy traffic. We
cannot trigger this problem in the first hour, rarely in the second
hour, and always after the second hour.

Now, our original setup did not have as much traffic. But based upon
a quick, back of the napkin computation, it seems to occurring after a
certain amount of traffic. I can try to get more firm numbers. But
this kind of behavior hints less at a race condition between two
writers, and is instead somehow dependent upon the amount of traffic.
Indeed, my test program only uses IPT_SO_GET_ENTRIES which does not
trigger the second path do do_add_counters. So I'm no longer thinking
the path through setsockopts is a cause of the problem.

So instead, it seems that the only way there could be multiple writers
(assuming that is the problem), is if there are multiple contexts
through which ipt_do_table() is called. So far, my perusal of the
code indicates only through the hooks in each of the iptables modules.
And it isn't clear to me how these are called. But it does seem that
even with the patch Eric provided (which fixes the seqcount update),
there is still a potential problem.

If indeed we have multiple contexts executing ipt_do_table(), it is
possible for more than just the seqcount to be corrupted. Indeed, it
seems that any updates to the internal structures could cause
problems. It isn't clear to me if there is anything modified here,
other than the counters, so I'm not sure if there are any other
issues. But regardless, if the counters could become corrupted, then
it is possible to break any rules that use them.

Anyway, based on earlier discussion, is there any reason not to use a
lock (presuming any solution properly takes into account possible
recursion)? I understand that the mainline is protected, but perhaps
in the RT version we can use seqlock (and prevent a recursive lock)?

Thanks,
Pete LaDow
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/