[PATCH RFC] get rid of cpupri lock

From: Chris Mason
Date: Mon Sep 13 2010 - 14:04:33 EST


I recently had the chance to try and tune 2.6.32 kernels running oracle
on OLTP workloads. One of the things oracle loves to do for tuning
these benchmarks is make all the database tasks involved SCHED_FIFO.
This is because they have their own userland locks and if they get
scheduled out, lock contention goes up.

<please insert flames about userland locking here>

<and here>

<and here>

The box I was tuning had 8 sockets and the first thing that jumped out
at me during the run was that we were spending all our system time
inside cpupri_set. Since the rq lock must be held in cpurpi_set, I
don't think we need the cpupri lock at all.

The patch below is entirely untested, mostly because I'm hoping for
hints on good ways to test it. Clearly Oracle RT isn't the workload we
really want to tune for, but I think this change is generally useful if
we can do it safely.

cpusets could also be used to mitigate this problem, but if we can just
avoid the lock it would be nice.

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 2722dc1..dd51302 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -115,7 +115,6 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
{
int *currpri = &cp->cpu_to_pri[cpu];
int oldpri = *currpri;
- unsigned long flags;

newpri = convert_prio(newpri);

@@ -134,26 +133,15 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
if (likely(newpri != CPUPRI_INVALID)) {
struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];

- raw_spin_lock_irqsave(&vec->lock, flags);
-
cpumask_set_cpu(cpu, vec->mask);
- vec->count++;
- if (vec->count == 1)
+ if (atomic_inc_return(&vec->count) == 1)
set_bit(newpri, cp->pri_active);
-
- raw_spin_unlock_irqrestore(&vec->lock, flags);
}
if (likely(oldpri != CPUPRI_INVALID)) {
struct cpupri_vec *vec = &cp->pri_to_cpu[oldpri];
-
- raw_spin_lock_irqsave(&vec->lock, flags);
-
- vec->count--;
- if (!vec->count)
+ if (atomic_dec_and_test(&vec->count))
clear_bit(oldpri, cp->pri_active);
cpumask_clear_cpu(cpu, vec->mask);
-
- raw_spin_unlock_irqrestore(&vec->lock, flags);
}

*currpri = newpri;
@@ -174,9 +162,8 @@ int cpupri_init(struct cpupri *cp)

for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+ atomic_set(&vec->count, 0);

- raw_spin_lock_init(&vec->lock);
- vec->count = 0;
if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
goto cleanup;
}
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 9fc7d38..fe07002 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -12,8 +12,7 @@
/* values 2-101 are RT priorities 0-99 */

struct cpupri_vec {
- raw_spinlock_t lock;
- int count;
+ atomic_t count;
cpumask_var_t mask;
};

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/