Re: [PATCH] io-controller: Fix task hanging when there are more thanone groups

From: Gui Jianfeng
Date: Tue Sep 15 2009 - 20:07:49 EST


Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 09:15:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Sep 09, 2009 at 03:38:25PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> On Mon, Sep 07, 2009 at 03:40:53PM +0800, Gui Jianfeng wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I happened to encount a bug when i test IO Controller V9.
>>>>>> When there are three tasks to run concurrently in three group,
>>>>>> that is, one is parent group, and other two tasks are running
>>>>>> in two different child groups respectively to read or write
>>>>>> files in some disk, say disk "hdb", The task may hang up, and
>>>>>> other tasks which access into "hdb" will also hang up.
>>>>>>
>>>>>> The bug only happens when using AS io scheduler.
>>>>>> The following scirpt can reproduce this bug in my box.
>>>>>>
>>>>> Hi Gui,
>>>>>
>>>>> I tried reproducing this on my system and can't reproduce it. All the
>>>>> three processes get killed and system does not hang.
>>>>>
>>>>> Can you please dig deeper a bit into it.
>>>>>
>>>>> - If whole system hangs or it is just IO to disk seems to be hung.
>>>> Only when the task is trying do IO to disk it will hang up.
>>>>
>>>>> - Does io scheduler switch on the device work
>>>> yes, io scheduler can be switched, and the hung task will be resumed.
>>>>
>>>>> - If the system is not hung, can you capture the blktrace on the device.
>>>>> Trace might give some idea, what's happening.
>>>> I run a "find" task to do some io on that disk, it seems that task hangs
>>>> when it is issuing getdents() syscall.
>>>> kernel generates the following message:
>>>>
>>>> INFO: task find:3260 blocked for more than 120 seconds.
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> find D a1e95787 1912 3260 2897 0x00000004
>>>> f6af2db8 00000096 f660075c a1e95787 00000032 f6600270 f6600508 c2037820
>>>> 00000000 c09e0820 f655f0c0 f6af2d8c fffebbf1 00000000 c0447323 f7152a1c
>>>> 0006a144 f7152a1c 0006a144 f6af2e04 f6af2db0 c04438df c2037820 c2037820
>>>> Call Trace:
>>>> [<c0447323>] ? getnstimeofday+0x57/0xe0
>>>> [<c04438df>] ? ktime_get_ts+0x4a/0x4e
>>>> [<c068ab68>] io_schedule+0x47/0x79
>>>> [<c04c12ee>] sync_buffer+0x36/0x3a
>>>> [<c068ae14>] __wait_on_bit+0x36/0x5d
>>>> [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>> [<c068ae93>] out_of_line_wait_on_bit+0x58/0x60
>>>> [<c04c12b8>] ? sync_buffer+0x0/0x3a
>>>> [<c0440fa4>] ? wake_bit_function+0x0/0x43
>>>> [<c04c1249>] __wait_on_buffer+0x19/0x1c
>>>> [<f81e4186>] ext3_bread+0x5e/0x79 [ext3]
>>>> [<f81e77a8>] htree_dirblock_to_tree+0x1f/0x120 [ext3]
>>>> [<f81e7923>] ext3_htree_fill_tree+0x7a/0x1bb [ext3]
>>>> [<c04a01f9>] ? kmem_cache_alloc+0x86/0xf3
>>>> [<c044c428>] ? trace_hardirqs_on_caller+0x107/0x12f
>>>> [<c044c45b>] ? trace_hardirqs_on+0xb/0xd
>>>> [<f81e09e4>] ? ext3_readdir+0x9e/0x692 [ext3]
>>>> [<f81e0b34>] ext3_readdir+0x1ee/0x692 [ext3]
>>>> [<c04b1100>] ? filldir64+0x0/0xcd
>>>> [<c068b86a>] ? mutex_lock_killable_nested+0x2b1/0x2c5
>>>> [<c068b874>] ? mutex_lock_killable_nested+0x2bb/0x2c5
>>>> [<c04b12db>] ? vfs_readdir+0x46/0x94
>>>> [<c04b12fd>] vfs_readdir+0x68/0x94
>>>> [<c04b1100>] ? filldir64+0x0/0xcd
>>>> [<c04b1387>] sys_getdents64+0x5e/0x9f
>>>> [<c04028b4>] sysenter_do_call+0x12/0x32
>>>> 1 lock held by find/3260:
>>>> #0: (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<c04b12db>] vfs_readdir+0x46/0x94
>>>>
>>>> ext3 calls wait_on_buffer() to wait buffer, and schedule the task out in TASK_UNINTERRUPTIBLE
>>>> state, and I found this task will be resumed after a quite long period(more than 10 mins).
>>> Thanks Gui. As Jens said, it does look like a case of missing queue
>>> restart somewhere and now we are stuck, no requests are being dispatched
>>> to the disk and queue is already unplugged.
>>>
>>> Can you please also try capturing the trace of events at io scheduler
>>> (blktrace) to see how did we get into that situation.
>>>
>>> Are you using ide drivers and not libata? As jens said, I will try to make
>>> use of ide drivers and see if I can reproduce it.
>>>
>> Hi Vivek, Jens,
>>
>> Currently, If there's only the root cgroup and no other child cgroup available, io-controller will
>> optimize to stop expiring the current ioq, and we thought the current ioq belongs to root group. But
>> in some cases, this assumption is not true. Consider the following scenario, if there is a child cgroup
>> located in root cgroup, and task A is running in the child cgroup, and task A issues some IOs. Then we
>> kill task A and remove the child cgroup, at this time, there is only root cgroup available. But the ioq
>> is still under service, and from now on, this ioq won't expire because "only root" optimization.
>> The following patch ensures the ioq do belongs to the root group if there's only root group existing.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@xxxxxxxxxxxxxx>
>
> Hi Gui,
>
> I have modified your patch a bit to improve readability. Looking at the
> issue closely I realized that this optimization of not expiring the
> queue can lead to other issues like high vdisktime in certain scenarios.
> While fixing that also noticed the issue of high rate of as queue
> expiration in certain cases which could have been avoided.
>
> Here is a patch which should fix all that. I am still testing this patch
> to make sure that something is not obiviously broken. Will merge it if
> there are no issues.
>
> Thanks
> Vivek
>
> o Fixed the issue of not expiring the queue for single ioq schedulers. Reported
> and fixed by Gui.
>
> o If an AS queue is not expired for a long time and suddenly somebody
> decides to create a group and launch a job there, in that case old AS
> queue will be expired with a very high value of slice used and will get
> a very high disk time. Fix it by marking the queue as "charge_one_slice"
> and charge the queue only for a single time slice and not for whole
> of the duration when queue was running.
>
> o There are cases where in case of AS, excessive queue expiration will take
> place by elevator fair queuing layer because of few reasons.
> - AS does not anticipate on a queue if there are no competing requests.
> So if only a single reader is present in a group, anticipation does
> not get turn on.
>
> - elevator layer does not know that As is anticipating hence initiates
> expiry requests in select_ioq() thinking queue is empty.
>
> - elevaotr layer tries to aggressively expire last empty queue. This
> can lead to lof of queue expiry
>
> o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
> queue completed and associated io context is eligible to anticipate. Also
> AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
> . This solves above mentioned issues.
>
> o Moved some of the code in separate functions to improve readability.
>
> Signed-off-by: Vivek Goyal <vgoyal@xxxxxxxxxx>

I'd like to have a try this patch :)

--
Regards
Gui Jianfeng

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/