[BUG] perf: perf sched warning possibly due to clock granularity onAMD

From: Stephane Eranian
Date: Mon Feb 06 2012 - 08:25:57 EST


Hi,

I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
I was testing perf sched and I ran into an issue. That issue seems
to exist only on AMD and not on Intel systems. It is not PMU related
because I am doing tracing.

I am running a simple ping pong test to stress the context switch
code. Two processes exchanging a byte through a pipe (program provided
below).

Then, I capture a trace using perf sched, and I run sched lat, but
on certain runs, I get:

$ perf sched rec pong 2
$ perf sched lat
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: Timestamp below last timeslice flush
Warning: TimesFound 4934 unknown events!
Is this an older tool processing a perf.data file generated by a more recent tool?

I did some investigation in the kernel to figure out what
could cause this problem. I found out that with this measurement, it is
possible to get multiple consecutive samples with the SAME timestamp.
That is possible on AMD because, apparently, local_clock() ends up calling,
a sched_clock() function that is only based on jiffies. That means, the
granularity is a tick (1ms in my case). On Intel systems, at least on mines,
local_clock() ends up calling a routine which uses TSC and thus has a much
greater precision. This is all decided by the sched_clock_stable variable. It
is set at init time for Intel and not used on AMD. Yet, recent AMD systems do
have support for NONSTOP_TSC. It is not clear to me why that would not be enough
(unless it is a cross-socket synchronization issue).

One thing is clear, is that if I enable sched_clock_stable on my AMD system, then
the perf sched warning disappears.

I believe there is really a bug in perf. I am sure there are systems out there
which cannot provide a very fine grain timestamp, thus it is possible to get samples
with identical timestamps. I don't quite understand the sample re-ordering code in
perf but it seems sensitive to timer granularity. I think it should not.

In summary, two issues:
- Why is sched_clock_stable not set or even tested on recent AMD systems?
- perf should not rely on fine granularity timestamps in the re-ordering code

Any comments?

#include <sys/types.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <signal.h>
#include <syscall.h>
#include <err.h>

#define MAX_CPUS 64
#define NR_CPU_BITS (MAX_CPUS>>3)
int
pin_cpu(pid_t pid, unsigned int cpu)
{
uint64_t my_mask[NR_CPU_BITS];

if (cpu >= MAX_CPUS)
errx(1, "this program supports only up to %d CPUs", MAX_CPUS);

my_mask[cpu>>6] = 1ULL << (cpu&63);

return syscall(__NR_sched_setaffinity, pid, sizeof(my_mask), &my_mask);
}

static volatile int quit;
void sig_handler(int n)
{
quit = 1;
}

static void
do_child(int fr, int fw)
{
char c;
ssize_t ret;

for(;;) {
ret = read(fr, &c, 1);
if (ret < 0)
break;
ret = write(fw, "c", 1);
if (ret < 0)
break;

}
printf("child exited\n");
exit(0);
}


int
main(int argc, char **argv)
{
int ret;
int pr[2], pw[2];
int which_cpu;
pid_t pid;
uint64_t nctx = 0;
ssize_t nbytes;
char c = '0';
int delay;

delay = argc > 1 ? atoi(argv[1]) : 10;

srandom(getpid());
which_cpu = random() % sysconf(_SC_NPROCESSORS_ONLN);

ret = pipe(pr);
if (ret)
err(1, "cannot create read pipe");

ret = pipe(pw);
if (ret)
err(1, "cannot create write pipe");

ret = pin_cpu(getpid(), which_cpu);
if (ret)
err(1, "cannot pin to CPU%d", which_cpu);

printf("Both processes pinned to CPU%d, running for %ds\n", which_cpu, delay);

switch(pid=fork()) {
case -1:
err(1, "cannot create child");
case 0:
close(pr[1]); close(pw[0]);
do_child(pr[0], pw[1]);
exit(1);
}

close(pr[0]);
close(pw[1]);

signal(SIGALRM, sig_handler);
alarm(delay);

while(!quit) {
nbytes = write(pr[1], "c", 1);
nbytes = read(pw[0], &c, 1);
nctx++;
}
close(pr[1]);
close(pw[0]);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/