Re: Kernel crash after using new Intel NIC (igb)

From: Eric Dumazet
Date: Thu May 26 2011 - 18:01:42 EST


Le jeudi 26 mai 2011 Ã 14:48 -0700, Arun Sharma a Ãcrit :
> On 5/26/11 12:47 PM, Eric Dumazet wrote:
>
> > You dont get the problem. Problem is : We can do the empty() test only
> > if protected by the lock.
> >
> > If not locked, result can be wrong. [ false positive or negative ]
> >
>
>
> Agreed. Failing to unlink from unused list when we should have sounds wrong.
>
> >> The list modification under unused_peers.lock looks generally safe. But
> >> the control flow (based on refcnt) done outside the lock might have races.
> >>
> >
> > "might" is not a good word when dealing with this ;)
>
> Potential race in the current code:
>
> initial refcnt = 1
>
> T1: T2
>
> atomic_dec_and_lock(refcnt)
> // refcnt == 0
>
> atomic_add_unless(refcnt)
> unlink_from_unused()
>
> list_add_tail(unused)
> // T2 using "unused" entry
>
>
> > Did you test my fix ?
>
> I could try it on one or two machines - but it won't tell us anything
> for weeks if not months. Unfortunately my next window to try a new
> kernel on a large enough sample is several months away.
>
> >
> > Its doing the right thing : Using refcnt as the only marker to say if
> > the item must be removed from unused list (and lock the central lock
> > protecting this list only when needed)
> >
> > Since we already must do an atomic operation on refcnt, using
> > atomic_inc_return [ or similar full barrier op ] is enough to tell us
> > the truth.
>
> Yeah - using the refcnt seems better than list_empty(), but I'm not sure
> that your patch addresses the race above.

It does.

It becomes

T1: T2
>
> atomic_dec_and_lock(refcnt)
> // refcnt == 0
>
> newref = atomic_add_unless_and_return(refcnt)
>
> list_add_tail(unused)
unlock();
>
> if (newref == 1) {
lock()
unlink_from_unused()
unlock()
}



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/