2.6.27.8 scheduler bug - threads not being scheduled for long periods

From: Dimitri Sivanich
Date: Mon Jan 05 2009 - 12:57:41 EST

Next message: Eduard - Gabriel Munteanu: "Re: [PATCH 3/3] kmemtrace: Use tracepoints instead of markers."
Previous message: Jaswinder Singh Rajput: "Re: [PATCH] firmware: Arrange WHENCE in alphabetical order to avoid conflicts"
Next in thread: Gregory Haskins: "Re: 2.6.27.8 scheduler bug - threads not being scheduled for longperiods"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This concerns a scheduler bug that we've been looking at in 2.6.27.8 on
ia64 Altix machines.

Below is the current testcase that we're using to reproduce this problem.

The test should be run as follows:
taskset -c <cpu> t -c <another cpu> -n <non-zero nice level>

It creates 5 pthreads. Each of those threads changes their thread affinity
to the cpu passed in, sets their nice value to the passed in value, and
counts in a loop.

Depending on what cpus are chosen for the pair of cpus passed into taskset
and the test itself, output will look something like the following:

# taskset -c 3 t -c 2 -n -5
Create thread 0
Create thread 1
slave 0: pid 4086 cpu 2 nlevel -5
Create thread 2
slave 1: pid 4087 cpu 2 nlevel -5
Create thread 3
slave 2: pid 4088 cpu 2 nlevel -5
Create thread 4
slave 3: pid 4089 cpu 2 nlevel -5
slave 4: pid 4090 cpu 2 nlevel -5
Wait ready
GO
Counts:
0: 47996540
1: 47548033
2: 3419
3: 3249
4: 67665404

Counts:
0: 96877777
1: 95094916
2: 3419
3: 3249
4: 116545241

Note that threads 2 and 3 have stopped running. They will eventually run
at some time in the future, but this length of time can be extremely long
depending on the pair of cpus chosen. The test usually fails with the same
pattern if the same pair of cpus is always chosen. It will run differently
for different cpu pairs, and will run fine for certain pairs of cpus.

After some examination we've found the overall reason for this. The
runqueue rb tree is sorted by the vruntime value associated with each
thread. These vruntime values can occasionally get set to values far above
other thread's vruntime values, leaving those threads to starve waiting to
be moved up in the runqueue.

One place we've found this happens is in update_curr(), which calculates a
delta_exec value as follows:
delta_exec = (unsigned long)(now - curr->exec_start);

Sometimes this value will be very large, as 'now' (the rq clock time) will
be less than 'exec_start'. When this happens, __update_curr() will
calculate a delta_exec_weighted based on this large value and add it to the
thread's vruntime:
curr->vruntime += delta_exec_weighted;

For this particular case, we've found that if we add a hack to clamp the
calculation of delta_exec to always be positive, this will allow the above
testcase to run.

+ if (now < curr->exec_start)
+ curr->exec_start = now - 1;

delta_exec = (unsigned long)(now - curr->exec_start);

I'm wondering if anyone else has run into this problem, and whether they've
found a resolution for it?

/*
* gcc -o jtest jtest.c -lpthread
* jtest -c8 -v
*/
#define _GNU_SOURCE 1
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/types.h>

#define NTHR 5
#define perrorx(s) do {perror(s); exit(1);} while (0)

extern int optind, opterr;
extern char *optarg;

static void *slave(void *arg);

volatile int ready, go = 0;

struct slave_s {
int cpu;
int id;
int nlevel;
} slave_t[NTHR];

unsigned long x[NTHR];

static int runonx(int cpu)
{
cpu_set_t mask;

if (cpu < 0 || cpu >= 1024)
return -1;
memset(&mask, 0, sizeof (mask));
CPU_SET(cpu, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) < 0)
perrorx("setaffinity failed");
return 0;
}

int main(int argc, char **argv)
{
int c, th, k = 0, er = 0;
int cpu = 1;
int nlevel = 0;
unsigned loops = (unsigned)-1;
unsigned long i = 0;
static char optstr[] = "c:l:n:v";
pthread_t *threads;

opterr = 1;
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'c':
cpu = atoi(optarg);
break;
case 'l':
loops = atoi(optarg);
break;
case 'n':
nlevel = atoi(optarg);
break;
case '?':
er = 1;
break;
}

if (er)
exit(1);

sleep(1);
threads = malloc(sizeof(pthread_t) * NTHR);
for (th = 0; th < NTHR; th++) {
printf("Create thread %d\n", th);
slave_t[th].cpu = cpu;
slave_t[th].id = th;
slave_t[th].nlevel = nlevel;
if (pthread_create(&threads[th], NULL, slave, (void *)&slave_t[th]) != 0)
perrorx("pthread_create failed:");
}
runonx(cpu);
if (1 || nlevel) {
errno = 0;
if (nice(nlevel) == -1 && errno) {
perror("nice");
exit(-1);
}
}

printf("Wait ready\n");
while (ready < NTHR)
usleep(10000);
printf("GO\n");
while(k<loops) {
i++;
if (i % 10000000 == 0) {
int j;
printf("Counts:\n");
for (j=0; j<NTHR; j++)
printf("%d: %lu\n",j,x[j]);
printf("\n");
k++;
}
}
go++;
for (th = 0; th < NTHR; th++)
pthread_join(threads[th], NULL);
printf("Master Done\n");
return 0;
}

static void *slave(void *arg)
{
struct slave_s * s = (struct slave_s *)arg;

runonx(s->cpu);
if (1 || s->nlevel) {
errno = 0;
if (nice(s->nlevel) == -1 && errno) {
perror("nice");
exit(-1);
}
}
printf("slave %d: pid %ld cpu %d nlevel %d\n", s->id, syscall(SYS_gettid), s->cpu, s->nlevel);
(void)__sync_fetch_and_add(&ready, 1);

while (!go) {
x[s->id]++;
}

printf("slave %d done\n", s->id);

pthread_exit(0);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eduard - Gabriel Munteanu: "Re: [PATCH 3/3] kmemtrace: Use tracepoints instead of markers."
Previous message: Jaswinder Singh Rajput: "Re: [PATCH] firmware: Arrange WHENCE in alphabetical order to avoid conflicts"
Next in thread: Gregory Haskins: "Re: 2.6.27.8 scheduler bug - threads not being scheduled for longperiods"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]