Re: [BUG] cdev_put() race condition

From: Boaz Harrosh
Date: Wed Dec 17 2008 - 11:09:52 EST

Next message: Tino Keitel: "Re: Very slow header cache in mutt if the maildir is on ext3"
Previous message: Kumar Gala: "Please pull from 'merge' for 2.6.28"
In reply to: Hans Verkuil: "Re: [BUG] cdev_put() race condition"
Next in thread: Hans Verkuil: "Re: [BUG] cdev_put() race condition"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hans Verkuil wrote:
> On Wednesday 17 December 2008 15:52:38 Boaz Harrosh wrote:
>> Hans Verkuil wrote:
>>> On Wednesday 17 December 2008 00:30:39 Greg KH wrote:
>>>> On Wed, Dec 17, 2008 at 12:23:41AM +0100, Hans Verkuil wrote:
>>>>> On Tuesday 16 December 2008 22:21:50 Greg KH wrote:
>>>>>> On Tue, Dec 16, 2008 at 10:00:51PM +0100, Hans Verkuil wrote:
>>>>>>> On Tuesday 16 December 2008 21:22:48 Greg KH wrote:
>>>>>>>> On Mon, Dec 08, 2008 at 09:56:26PM +0100, Hans Verkuil wrote:
>>>>>>>>> Hi Greg,
>>>>>>>>>
>>>>>>>>> Laurent found a race condition in the uvc driver that we traced
>>>>>>>>> to the way chrdev_open and cdev_put/get work.
>>>>>>>>>
>>>>>>>>> You need the following ingredients to reproduce it:
>>>>>>>>>
>>>>>>>>> 1) a hot-pluggable char device like an USB webcam.
>>>>>>>>> 2) a manually created device node for such a webcam instead of
>>>>>>>>> relying on udev.
>>>>>>>>>
>>>>>>>>> In order to easily force this situation you would also need to
>>>>>>>>> add a delay to the char device's release() function. For
>>>>>>>>> webcams that would be at the top of v4l2_chardev_release() in
>>>>>>>>> drivers/media/video/v4l2-dev.c. But adding a delay to e.g.
>>>>>>>>> cdev_purge would have the same effect.
>>>>>>>>>
>>>>>>>>> The sequence of events in the case of a webcam is as follows:
>>>>>>>>>
>>>>>>>>> 1) The USB device is removed, causing a disconnect.
>>>>>>>>>
>>>>>>>>> 2) The webcam driver unregisters the video device which in turn
>>>>>>>>> calls cdev_del().
>>>>>>>>>
>>>>>>>>> 3) When the last application using the device is closed, the
>>>>>>>>> cdev is released when the kref of the cdev's kobject goes to 0.
>>>>>>>>>
>>>>>>>>> 4) If the kref's release() call takes a while due to e.g. extra
>>>>>>>>> cleanup in the case of a webcam, then another application can
>>>>>>>>> try to open the video device. Note that this requires a device
>>>>>>>>> node created with mknod, otherwise the device nodes would
>>>>>>>>> already have been removed by udev.
>>>>>>>>>
>>>>>>>>> 5) chrdev_open checks inode->i_cdev. If this is NULL (i.e. this
>>>>>>>>> device node was never accessed before), then all is fine since
>>>>>>>>> kobj_lookup will fail because cdev_del() has been called
>>>>>>>>> earlier. However, if this device node was used earlier, then
>>>>>>>>> the else part is called: cdev_get(p). This 'p' is the cdev that
>>>>>>>>> is being released. Since the kref count is 0 you will get a
>>>>>>>>> WARN message from kref_get, but the code continues on, the
>>>>>>>>> f_op->open will (hopefully) return more-or-less gracefully with
>>>>>>>>> an error and the cdev_put at the end will cause the refcount to
>>>>>>>>> go to 0 again, which results in a SECOND call to the kref's
>>>>>>>>> release function!
>>>>>>>>>
>>>>>>>>> See this link for the original discussion on the v4l list
>>>>>>>>> containing stack traces an a patch that you need if you want to
>>>>>>>>> (and can) test this with the uvc driver:
>>>>>>>>>
>>>>>>>>> http://www.spinics.net/lists/vfl/msg39967.html
>>>>>>>> The second sentence in that message shows your problem here:
>>>>>>>> To avoid the need of a reference count in every v4l2 driver,
>>>>>>>> v4l2 moved to cdev which includes its own reference counting
>>>>>>>> infrastructure based on kobject.
>>>>>>>>
>>>>>>>> cdev is not ment to handle the reference counting of any object
>>>>>>>> outside of itself, and should never be embedded within anything.
>>>>>>>> I've been thinking of taking the real "kobject" out of that
>>>>>>>> structure for a long time now, incase someone did something
>>>>>>>> foolish like this.
>>>>>>>>
>>>>>>>> Seems I was too late :(
>>>>>>>>
>>>>>>>> So, to solve this, just remove the reliance on struct cdev in
>>>>>>>> your own structures, you don't want to do this for the very
>>>>>>>> reason you have now found (and for others, like the fact that
>>>>>>>> this isn't a "real" struct kobject in play here, just a fake
>>>>>>>> one.)
>>>>>>>>
>>>>>>>> Ick, what a mess.
>>>>>>> Sorry, but this makes no sense. First of all the race condition
>>>>>>> exists regardless of how v4l uses it. Other drivers using cdev with
>>>>>>> a hot-pluggable device in combination with a manually created
>>>>>>> device node should show the same problem.
>>>>>> I don't see how this would be, unless this problem has always
>>>>>> existed in the cdev code, it was created back before dynamic device
>>>>>> nodes with udev existed :)
>>>>>>
>>>>>>> It's just that we found it with v4l because the release callback
>>>>>>> takes longer than usual, thus increasing the chances of hitting the
>>>>>>> race.
>>>>>> The release callback for the cdev itself?
>>>>> Yes, although we override the release kobj_type.
>>>> Hahaha, yeah, I'm really afraid to say it, but you deserve the oops if
>>>> you do that :)
>>>>
>>>>> I noticed that a patch to do this through a cdev function was
>>>>> recently posted (and possibly already merged).
>>>> No, it was rejected as it was not needed.
>>>>
>>>>>>> The core problem is simply that it is possible to call cdev_get
>>>>>>> while in cdev_put! That should never happen.
>>>>>> True, and cdev_put is only called from __fput() which has the proper
>>>>>> locking to handle the call to cdev_get() if a char device is opened,
>>>>>> right?
>>>>> cdev_put is also called through cdev_del, and that's where the
>>>>> problem resides. The cdev_del() call has no locking to prevent a call
>>>>> from cdev_get(), and it can be called asynchronously as well in
>>>>> response to a USB disconnect.
>>>>>
>>>>>>> Secondly, why shouldn't struct cdev be embedded in anything? It's
>>>>>>> used in lots of drivers that way. I really don't see what's unusual
>>>>>>> or messy about v4l in that respect.
>>>>>> I don't see it embedded in any other structures at the moment, and
>>>>>> used to reference count the structure, do you have pointers to where
>>>>>> it is?
>>>>> Hmm, I took a closer look and it looks like v4l is indeed the first
>>>>> to use cdev for refcounting. Others either don't need it or do their
>>>>> own refcounting. It's definitely embedded in several structs, though.
>>>>> (grep for cdev_init)
>>>> Yes, embedding it is fine, just don't use it for reference counting as
>>>> it is using it's own reference counting for something other than what
>>>> you want to use it for.
>>>>
>>>> Overriding the release function should have given you the hint that
>>>> this is not the structure that you should be using for this kind of
>>>> operation.
>>>>
>>>>> However, I don't see how any amount of refcounting in drivers can
>>>>> prevent this race. At some point cdev_del() is called and (if nobody
>>>>> is using the chardev) cdev's kref goes to 0 triggering the release,
>>>>> and at that moment chrdev_open() can be called and the driver has no
>>>>> chance to prevent that, only cdev can do that.
>>>> Again, don't use cdev's reference counting for your own object
>>>> lifecycle, it is different and will cause problems, like you have
>>>> found out.
>>> Sigh. It has nothing to do with how v4l uses it. And to demonstrate
>>> this, here is how you reproduce it with the sg module (tested it with
>>> my USB harddisk).
>>>
>>> 1) apply this patch to char_dev.c:
>>>
>>> --- char_dev.c.orig 2008-12-17 13:30:22.000000000 +0100
>>> +++ char_dev.c 2008-12-17 08:50:23.000000000 +0100
>>> @@ -412,6 +412,9 @@
>>>
>>> static void cdev_purge(struct cdev *cdev)
>>> {
>>> + printk("delayin\n");
>>> + mdelay(2000);
>>> + printk("delayout\n");
>>> spin_lock(&cdev_lock);
>>> while (!list_empty(&cdev->list)) {
>>> struct inode *inode;
>>>
>>>
>>> 2) connect your harddisk and make note of the sgX device that's
>>> created. 3) use mknod to create an sgX device manually for that
>>> harddisk. 4) open and close it (cat ./sgX >/dev/null followed by ctrl-C
>>> is sufficient) 5) disconnect the harddisk
>>> 6) when 'delayin' appears in the log, run cat ./sgX >/dev/null again
>>> 7) sit back and enjoy this lovely race condition:
>>>
>>> usb-storage: device found at 6
>>> usb-storage: waiting for device to settle before scanning
>>> scsi 6:0:0:0: Direct-Access WD 5000BEV External 1.05 PQ: 0
>>> ANSI: 4 sd 6:0:0:0: [sdg] 976773168 512-byte hardware sectors (500108
>>> MB) sd 6:0:0:0: [sdg] Write Protect is off
>>> sd 6:0:0:0: [sdg] Mode Sense: 21 00 00 00
>>> sd 6:0:0:0: [sdg] Assuming drive cache: write through
>>> sd 6:0:0:0: [sdg] 976773168 512-byte hardware sectors (500108 MB)
>>> sd 6:0:0:0: [sdg] Write Protect is off
>>> sd 6:0:0:0: [sdg] Mode Sense: 21 00 00 00
>>> sd 6:0:0:0: [sdg] Assuming drive cache: write through
>>> sdg: sdg1
>>> sd 6:0:0:0: [sdg] Attached SCSI disk
>>> sd 6:0:0:0: Attached scsi generic sg8 type 0
>>> usb-storage: device scan complete
>>> usb 2-4.1: USB disconnect, address 6
>>> delayin
>>> ------------[ cut here ]------------
>>> WARNING: at lib/kref.c:43 kref_get+0x17/0x1c()
>>> Modules linked in: sg
>>> Pid: 1491, comm: cat Not tainted 2.6.26-durdane-test #3
>>> [<c0134ad4>] warn_on_slowpath+0x40/0x79
>>> [<c0160ff8>] __do_fault+0x2e4/0x334
>>> [<c017e0c7>] __d_lookup+0xce/0x115
>>> [<c0175393>] do_lookup+0x53/0x145
>>> [<c017d523>] dput+0x16/0xdc
>>> [<c0176d53>] __link_path_walk+0xb36/0xc30
>>> [<c0181d44>] mntput_no_expire+0x18/0xef
>>> [<c02225fb>] kref_get+0x17/0x1c
>>> [<c0221be2>] kobject_get+0xf/0x13
>>> [<c01710b1>] cdev_get+0x55/0x69
>>> [<c01712ce>] chrdev_open+0xc2/0x184
>>> [<c017120c>] chrdev_open+0x0/0x184
>>> [<c016dd3b>] __dentry_open+0x127/0x210
>>> [<c016dea1>] nameidata_to_filp+0x1c/0x2c
>>> [<c01780c4>] do_filp_open+0x33d/0x6d5
>>> [<c016daee>] get_unused_fd_flags+0xb8/0xc2
>>> [<c016db35>] do_sys_open+0x3d/0xb6
>>> [<c016dbf2>] sys_open+0x1e/0x23
>>> [<c01188a9>] sysenter_past_esp+0x6a/0x91
>>> =======================
>>> ---[ end trace c8225d49e3c03b85 ]---
>>> delayin
>>> delayout
>>> delayout

OK Now I do not understand something. How is the double release possible?
I guess it is something to do with the complicated chrdev_open() in its
inode->i_cdev == NULL case. But still isn't the kref inside kobject
suppose to protect me from exactly that?

>>>
>>> Note the duplicate 'delayin' messages. Also note that the cdev struct
>>> was allocated by sg.c, so the second cdev cleanup will likely poke into
>>> already freed memory.
>>>
>>> I tried to reproduce it as well without the patched char_dev.c by using
>>> this little program:
>>>
>>> #include <stdio.h>
>>> #include <unistd.h>
>>> #include <fcntl.h>
>>>
>>> main()
>>> {
>>> int state = 0;
>>>
>>> while (1) {
>>> int fh = open("./sg8", O_RDONLY);
>>> if (fh >= 0) {
>>> if (state == 0) {
>>> printf("opened\n");
>>> state = 1;
>>> }
>>> close(fh);
>>> }
>>> else {
>>> if (state) {
>>> printf("error\n");
>>> state = 0;
>>> }
>>> }
>>> }
>>> }
>>>
>>> Just run it and connect/disconnect the USB harddisk.
>>>
>>> Unfortunately, I wasn't able to reproduce it because I always hit a bug
>>> in khelper first after only a few attempts:
>>>
>>> sd 13:0:0:0: [sdg] Assuming drive cache: write through
>>> sdg: sdg1
>>> sd 13:0:0:0: [sdg] Attached SCSI disk
>>> sd 13:0:0:0: Attached scsi generic sg8 type 0
>>> usb-storage: device scan complete
>>> usb 2-4.1: USB disconnect, address 13
>>> VFS: Close: file count is 0
>>> BUG: unable to handle kernel NULL pointer dereference at 00000004
>>> IP: [<c0409df2>] _spin_lock_irqsave+0x13/0x25
>>> *pdpt = 000000002ebb5001 *pde = 0000000000000000
>>> Oops: 0002 [#1] PREEMPT SMP
>>> Modules linked in: nvidia(P)
>>>
>>> Pid: 13662, comm: khelper Tainted: P (2.6.26-durdane #1)
>>> EIP: 0060:[<c0409df2>] EFLAGS: 00010002 CPU: 1
>>> EIP is at _spin_lock_irqsave+0x13/0x25
>>> EAX: 00000004 EBX: 00000000 ECX: 00000213 EDX: 00000100
>>> ESI: 00000004 EDI: 00000000 EBP: 00000000 ESP: eeabffc8
>>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>> Process khelper (pid: 13662, ti=eeabe000 task=eeb280c0
>>> task.ti=eeabe000) Stack: c012e788 ee8fb300 00000000 00000000 c014144c
>>> 00000000 c01413ec c01194ab
>>> ee8fb300 00000000 00000000 00000000 00000000 00000000
>>> Call Trace:
>>> [<c012e788>] complete+0xf/0x36
>>> [<c014144c>] wait_for_helper+0x60/0x65
>>> [<c01413ec>] wait_for_helper+0x0/0x65
>>> [<c01194ab>] kernel_thread_helper+0x7/0x10
>>> =======================
>>> Code: 90 89 e2 81 e2 00 e0 ff ff ff 42 14 f0 83 28 01 79 05 e8 ae ff ff
>>> ff c3 9c 59 fa 89 e2 81 e2 00 e0 ff ff ff 42 14 ba 00 01 00 00 <f0> 66
>>> 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 c3 fa 89 e2
>>> EIP: [<c0409df2>] _spin_lock_irqsave+0x13/0x25 SS:ESP 0068:eeabffc8
>>> ---[ end trace f5d38a13beb6dc77 ]---
>>> note: khelper[13662] exited with preempt_count 1
>>>
>>> I tried the same on a 2.6.28-rc8 kernel and got the same khelper error.
>>> Lovely.
>>>
>>> So, my conclusion remains that cdev has a race condition due to
>>> insufficient locking between cdev_del/put and cdev_get. And that
>>> khelper has it's own problems as well :-(
>>>
>>> Regards,
>>>
>>> Hans
>> Hi Hans
>>
>> Just as a far shot could you try this patch:
>>
>> [PATCH RESEND] char_dev: add cdev->release() and convert cdev_alloc() to
>> use it http://www.spinics.net/lists/kernel/msg808783.html
>>
>> Your goals look related. Also see the lkml thread.
>>
>> Please cc me on the matter.
>
> Hi Boaz,
>
> It will not fix this particular race condition, but it would make it easier
> to hook the v4l release callback into cdev. It's a bit of a hack at the
> moment. Greg clearly doesn't like drivers doing things like this, but it's
> still not clear to me why not. As far as I can tell it is the only reliable
> place where you can be 100% certain that it is safe to do the cleanup since
> no one else is or can use the char device ever again.
>
> Regards,
>
> Hans
>

I had a similar problem and solved it in a way that I think should be safe.
I'll try and run your tests to make sure. Here is a short description of my
solution:

MyDevice {
...
embed_cdev;
embed_kref;
...
};

OnHotPlug() {
...
kref_init(embed_kref); // kref==1
cdev_add(embed_cdev); // cdev==1
get_(embed_cdev); // cdev++==2
}

OnFileOpen() { // kernel does // cdev++==n > 2
kref_get(embed_kref); // kref++==n > 1
}

OnFileClose() {
kref_put(embed_kref, __release); // kref++==n >= 0
} // kernel does // cdev--==n >= 1

OnHotRemove() {
cdev_del(embed_cdev); // cdev--==1
kref_put(embed_kref, __release); // kref--==n >= 0
}

__release() { // by definition kref==0
put_(embed_cdev); // cdev--==0
}

What do you think?

BTW sg does not use kref and I suspect it might be racy by its own.

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tino Keitel: "Re: Very slow header cache in mutt if the maildir is on ext3"
Previous message: Kumar Gala: "Please pull from 'merge' for 2.6.28"
In reply to: Hans Verkuil: "Re: [BUG] cdev_put() race condition"
Next in thread: Hans Verkuil: "Re: [BUG] cdev_put() race condition"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]