Re: A peculiarity in ptrace/waitpid behavior

From: Pavel Labath
Date: Fri Mar 20 2015 - 14:53:53 EST


Sending again, this time as plain text (I hope)...

On 20 March 2015 at 18:46, Pavel Labath <labath@xxxxxxxxxx> wrote:
>
> Hi,
>
> thanks for the super quick response. :)
>
> I am at home now, so I don't have access to the same machine to run the test. I will run it on monday and let you know.
>
> Meanwhile, I have tried running your test on my home machine, and it is indeed reporting "unexpected wait: stat=57f". If I understand correctly, that means the wait has reported sigtrap even though the tracee was in ptrace-stop.
>
> I can imagine that something similar is happening in our case. Since PTRACE_CONT and waitpid calls are happening in different threads, I can't positively say which one has occurred sooner. So far I have assumed the sequence was PTRACE_CONT -> waitpid -> PTRACE_SIGINFO. However, if wait can return even though the process is stopped then a possible sequence of events is waitpid -> PTRACE_CONT -> PTRACE_SIGINFO, in which case it is not surprising that the last call fails. One difference I see though is that in our test, we are not sending any additional signals to the thread in question (at least we shouldn't be sending them, but we are sending some signals to other threads in the same process). Do you think it could still be the same issue?
>
> I would be happy to test your patch. I don't think I can patch the kernel on my work machine directly, but I think I might be able to set up some sort of a test environment to try it out.
>
> regards,
> pavel
>
>
> On 20 March 2015 at 16:25, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
>>
>> Hi Pavel,
>>
>> let me add lkml, we should not discuss this offlist.
>>
>> On 03/20, Pavel Labath wrote:
>> >
>> > 1) we get a waitpid() notification that the tracee got SIGUSR1
>> > 2) we do a ptrace(GETSIGINFO) to get more info
>> > 3) eventually we decide to restart the tracee with PTRACE_CONT, passing it
>> > SIGUSR1
>> > 4) immediately after that we get another waitpid notification, again with
>> > SIGUSR1, even though the thread had received no additional signals
>> > 5) we again try to a GETSIGINFO, however this time it fails with ESRCH.
>> > Therefore, we assume that the thread has died
>>
>> I found a similar bug by code inspection some time ago. I even have
>> a fix, but I need to think more... And I even wrote the test-case ;)
>> see below.
>>
>> But so far I can't say if you hit the same problem or not. If you can
>> reproduce the problem, perhaps I can send you debugging patch?
>>
>> Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/