Re: [PATCH 1/2] acpi : cpu hot-remove returns error number when cpu_down()fails

From: Yasuaki Ishimatsu
Date: Tue Jul 10 2012 - 04:24:03 EST


Hi Srivatsa,

2012/07/10 16:57, Srivatsa S. Bhat wrote:
On 07/10/2012 10:27 AM, Yasuaki Ishimatsu wrote:
Hi Toshi,

2012/07/10 6:15, Toshi Kani wrote:
On Mon, 2012-07-09 at 16:55 +0530, Srivatsa S. Bhat wrote:
On 07/09/2012 08:01 AM, Yasuaki Ishimatsu wrote:
Hi Srivatsa,

Thank you for your reviewing.

2012/07/06 18:51, Srivatsa S. Bhat wrote:
On 07/06/2012 08:46 AM, Yasuaki Ishimatsu wrote:
Even if cpu_down() fails, acpi_processor_remove() continues to
remove the cpu.

Ouch!

But in this case, it should return error number since some process
may run on
the cpu. If the cpu has a running process and the cpu is turned
the power off,
the system cannot work well.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx>

---
drivers/acpi/processor_driver.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c
2012-06-25 04:53:04.000000000 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-05
21:02:58.711285382 +0900
@@ -610,7 +610,7 @@ err_free_pr:
static int acpi_processor_remove(struct acpi_device *device,
int type)
{
struct acpi_processor *pr = NULL;
-
+ int ret;

if (!device || !acpi_driver_data(device))
return -EINVAL;
@@ -621,8 +621,9 @@ static int acpi_processor_remove(struct
goto free;

if (type == ACPI_BUS_REMOVAL_EJECT) {
- if (acpi_processor_handle_eject(pr))
- return -EINVAL;
+ ret = acpi_processor_handle_eject(pr);
+ if (ret)
+ return ret;
}

acpi_processor_power_exit(pr, device);
@@ -841,12 +842,17 @@ static acpi_status acpi_processor_hotadd

static int acpi_processor_handle_eject(struct acpi_processor *pr)
{
- if (cpu_online(pr->id))
- cpu_down(pr->id);
+ int ret;
+
+ if (cpu_online(pr->id)) {
+ ret = cpu_down(pr->id);
+ if (ret)
+ return ret;
+ }


Strictly speaking, this is not thorough enough. What prevents someone
from onlining that same cpu again, at this point?
So, IMHO, you need to wrap the contents of this function inside a
get_online_cpus()/put_online_cpus() block, to prevent anyone else
from messing with CPU hotplug at the same time.

If I understand your comment by mistake, please let me know.
If the contents is wrapped a inside
get_online_cpus()/put_online_cpus() block
as below, cpu_down() will stop since cpu_down() calls
cpu_hotplug_begin() and
cpu_hotplug_begin() waits for cpu_hotplug.refcount to become 0.


You are right. Sorry, I overlooked that.

+ get_online_cpus()
+ if (cpu_online(pr->id)) {
+ ret = cpu_down(pr->id);
+ if (ret)
+ return ret;
+ }
+ put_online_cpus()

I think following patch can prevent it correctly. How about the patch?

---
drivers/acpi/processor_driver.c | 12 ++++++++++++
kernel/cpu.c | 8 +++++---
2 files changed, 17 insertions(+), 3 deletions(-)

Index: linux-3.5-rc4/drivers/acpi/processor_driver.c
===================================================================
--- linux-3.5-rc4.orig/drivers/acpi/processor_driver.c 2012-07-09
09:59:01.280211202 +0900
+++ linux-3.5-rc4/drivers/acpi/processor_driver.c 2012-07-09
11:05:34.559859236 +0900
@@ -844,14 +844,26 @@ static int acpi_processor_handle_eject(s
{
int ret;

+retry:
if (cpu_online(pr->id)) {
ret = cpu_down(pr->id);
if (ret)
return ret;
}

+ get_online_cpus();
+ /*
+ * Someone might online the cpu again at this point. So we
check that
+ * cpu has been onlined or not. If cpu is online, we try to
offline
+ * the cpu again.
+ */
+ if (cpu_online(pr->id)) {

How about this:
if (unlikely(cpu_online(pr->id)) {
since the probability of this happening is quite small...

+ put_online_cpus();
+ goto retry;
+ }
arch_unregister_cpu(pr->id);
acpi_unmap_lsapic(pr->id);
+ put_online_cpus();
return ret;
}

This retry logic doesn't look elegant, but I don't see any better
method :-(

Another possible option is to fail the request instead of retrying it.

Good idea!! I'll update it.


It would be quite challenging to allow on-lining and off-lining
operations to run concurrently. In fact, even if we close this window,
there is yet another window right after the new put_online_cpus() call.

I think if we close the window, another window does not open.
acpi_unmap_lsapic() sets cpu_present mask to false before new
put_online_cpus()
is called. So even if _cpu_up() is called, the function returns -EINAVL by
following added code.

@@ -343,11 +343,13 @@ static int __cpuinit _cpu_up(unsigned in
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct task_struct *idle;

- if (cpu_online(cpu) || !cpu_present(cpu))
- return -EINVAL;
-
cpu_hotplug_begin();

+ if (cpu_online(cpu) || !cpu_present(cpu)) {
+ ret = -EINVAL;
+ goto out;
+ }
+


Right. Yasuaki's patch will ensure that there are no more race conditions
because it does the cpu_present() check after taking the cpu_hotplug.lock.
So I think it is safe and still doable from the kernel's perspective.

But the question is, "should we do it?" I think Toshi's suggestion of failing
the hot-remove request (if we find that the cpu has been onlined again by some
other task) sounds like a good idea for another reason: cpu hotplug is not
initiated by the OS by itself; its requested by the user; so if something is
onlining the cpu back again, the user better take a second look and decide
what exactly he wants to do with that cpu - whether keep it online or
hot-remove it out.

I think so too. Failing the hot-remove request is good idea.


Trying to online as well as hot-remove the same cpu simultaneously sounds like
a crazy thing to do, and returning -EBUSY or -EAGAIN in the hot-remove case
(ie., failing that request) would give a warning to the user and a chance to
reflect upon what exactly he wants to do with the cpu.

So, IMHO, we should protect against the race condition (between cpu_up and
hot-remove) but choose to fail the hot-remove request, and add a comment saying
why we chose to fail the request, even though we could have gone ahead and
completed it.

I have already sent 2nd version of the patch. But the warning message is
not included in the patch. So I will add the warning message into 3rd
version.

Thanks,
Yasuaki Ishimatsu

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/