Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

From: Vikram Mulukutla
Date: Sat May 14 2016 - 14:28:55 EST


On 5/14/2016 8:39 AM, Thomas Gleixner wrote:
On Fri, 13 May 2016, Vikram Mulukutla wrote:
On 5/13/2016 7:58 AM, Peter Zijlstra wrote:
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

static __always_inline int preempt_count(void)
{
- return current_thread_info()->preempt_count;
+ return READ_ONCE(current_thread_info()->preempt_count);
}

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
{
return &current_thread_info()->preempt_count;
}


Thanks Peter, this patch worked for me. The compiler no longer optimizes out
the increment/decrement of the preempt_count.

I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.

We already have:

#define preempt_disable() \
do { \
preempt_count_inc(); \
barrier(); \
} while (0)

#define sched_preempt_enable_no_resched() \
do { \
barrier(); \
preempt_count_dec(); \
} while (0)

#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)

So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?


While I cannot claim more than a very rudimentary knowledge of compilers, this was the initial reaction that I had as well. But then the barriers are in the wrong spots for the way the code was used in the driver in question. preempt_disable has the barrier() _after_ the increment, and sched_preempt_enable_no_resched has it _before_ the decrement. Therefore, if one were to use preempt_enable_no_resched followed by preempt_disable, there is no compiler barrier between the decrement and the increment statements. Whether this should translate to such a seemingly aggressive optimization is something I'm not completely sure of.

There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?


I think the barrier is working as intended for the sequence of preempt_disable followed by preempt_enable_no_resched.

Thanks,

tglx




As a simple experiment I used gcc on x86 on the following C program. This was really to convince myself rather than you and Peter!

#include <stdio.h>

#define barrier() __asm__ __volatile__("": : :"memory")

struct thread_info {
int pc;
};

#define preempt_count() \
ti_ptr->pc

#define preempt_disable() \
do \
{ \
preempt_count() += 1; \
barrier(); \
} \
while (0)

#define preempt_enable() \
do \
{ \
barrier(); \
preempt_count() -= 1; \
} \
while (0)

struct thread_info *ti_ptr;

int main(void)
{
struct thread_info ti;
ti_ptr = &ti;
int b;

preempt_enable();
b = b + 500;
preempt_disable();

printf("b = %d\n", b);

return 0;
}

With gcc -O2 I get this (the ARM compiler behaves similarly):

0000000000400470 <main>:
400470: 48 83 ec 18 sub $0x18,%rsp
400474: 48 89 25 cd 0b 20 00 mov %rsp,0x200bcd(%rip)
40047b: ba f4 01 00 00 mov $0x1f4,%edx
400480: be 14 06 40 00 mov $0x400614,%esi
400485: bf 01 00 00 00 mov $0x1,%edi
40048a: 31 c0 xor %eax,%eax
40048c: e8 cf ff ff ff callq 400460 <__printf_chk@plt>
400491: 31 c0 xor %eax,%eax
400493: 48 83 c4 18 add $0x18,%rsp
400497: c3 retq

If I swap preempt_enable and preempt_disable I get this:

0000000000400470 <main>:
400470: 48 83 ec 18 sub $0x18,%rsp
400474: 48 89 25 cd 0b 20 00 mov %rsp,0x200bcd(%rip)
40047b: 83 04 24 01 addl $0x1,(%rsp)
40047f: 48 8b 05 c2 0b 20 00 mov 0x200bc2(%rip),%rax
400486: ba f4 01 00 00 mov $0x1f4,%edx
40048b: be 14 06 40 00 mov $0x400614,%esi
400490: bf 01 00 00 00 mov $0x1,%edi
400495: 83 28 01 subl $0x1,(%rax)
400498: 31 c0 xor %eax,%eax
40049a: e8 c1 ff ff ff callq 400460 <__printf_chk@plt>
40049f: 31 c0 xor %eax,%eax
4004a1: 48 83 c4 18 add $0x18,%rsp
4004a5: c3 retq

Note: If I place ti_ptr on the stack, the ordering/barriers no longer matter, the inc/dec is always optimized out. I suppose the compiler does treat current_thread_info as coming from a different memory location rather than the current stack frame. In any case, IMHO the volatile preempt_count patch is necessary.

Thanks,
Vikram

--