Re: Kernel crash after using new Intel NIC (igb)

From: Arun Sharma
Date: Thu May 26 2011 - 17:47:54 EST


On 5/26/11 12:47 PM, Eric Dumazet wrote:

You dont get the problem. Problem is : We can do the empty() test only
if protected by the lock.

If not locked, result can be wrong. [ false positive or negative ]



Agreed. Failing to unlink from unused list when we should have sounds wrong.

The list modification under unused_peers.lock looks generally safe. But
the control flow (based on refcnt) done outside the lock might have races.


"might" is not a good word when dealing with this ;)

Potential race in the current code:

initial refcnt = 1

T1: T2

atomic_dec_and_lock(refcnt)
// refcnt == 0

atomic_add_unless(refcnt)
unlink_from_unused()

list_add_tail(unused)
// T2 using "unused" entry


Did you test my fix ?

I could try it on one or two machines - but it won't tell us anything for weeks if not months. Unfortunately my next window to try a new kernel on a large enough sample is several months away.


Its doing the right thing : Using refcnt as the only marker to say if
the item must be removed from unused list (and lock the central lock
protecting this list only when needed)

Since we already must do an atomic operation on refcnt, using
atomic_inc_return [ or similar full barrier op ] is enough to tell us
the truth.

Yeah - using the refcnt seems better than list_empty(), but I'm not sure that your patch addresses the race above.

-Arun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/