Re: Kernel crash after using new Intel NIC (igb)

From: Arun Sharma
Date: Thu May 26 2011 - 15:29:50 EST


On 5/24/11 11:35 PM, Eric Dumazet wrote:

Another possibility is to do the list_empty() check twice. Once without
taking the lock and again with the spinlock held.


Why ?


Part of the problem is that I don't have a precise understanding of the race condition that's causing the list to become corrupted.

All I know is that doing it under the lock fixes it. If it's slowing things down, we do a check outside the lock (since it's cheap). But if we get the wrong answer, we verify it again under the lock.

list_del_init(&p->unused); (done under lock of course) is safe, you can
call it twice, no problem.

Doing it twice is not a problem. But doing it when we shouldn't be doing it could be the problem.

The list modification under unused_peers.lock looks generally safe. But the control flow (based on refcnt) done outside the lock might have races.

Eg: inet_putpeer() might find the refcnt go to zero, but before it adds it to the unused list, another thread may be doing inet_getpeer() and set refcnt to 1. In the end, we end up with a node that's potentially in use, but ends up on the unused list.

-Arun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/