Re: x86: bad pte in pageattr_test

From: Dmitry Vyukov
Date: Fri Jun 10 2016 - 06:18:49 EST


On Thu, Jun 9, 2016 at 11:34 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Tue, 7 Jun 2016, Dmitry Vyukov wrote:
>> >> I've got the following WARNING while running syzkaller fuzzer:
>> >>
>> >> CPA ffff880054118000: bad pte after revert 8000000054118363
>>
>> > CPA ffff880059990000: bad pte 8000000059990060
>
> In both cases the PTE bit which the test modifies is in the wrong state.
>
>> Should we delete this test if it is not important?
>
> No. There is something badly wrong.
>
> PAGE_BIT_CPA_TEST is the same as PAGE_BIT_SPECIAL. And the latter is used by
> the mm code to mark user space mappings. The test code only modifies the
> direct mapping, i.e. the kernel side one.
>
> So something sets PAGE_BIT_SPECIAL on a kernel PTE. And that's definitely a
> bug.
>
> These are the last entries from your syzkaller log file of the first incident:
>
> r0 = perf_event_open(&(0x7f000000f000-0x78)={0x2, 0x78, 0x11, 0x7, 0xd537, 0x6, 0x0, 0xc1, 0xffff, 0x5, 0x0, 0x40, 0x4, 0x9, 0x5369, 0x8, 0x7, 0x8508, 0x3, 0x80, 0x0}, 0x0, 0xffffffff, 0xffffffffffffffff, 0x0)
> mmap(&(0x7f0000cbb000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> r1 = syz_open_dev$mouse(&(0x7f0000cbb000)="2f6465762f696e7075742f6d6f7573652300", 0x100, 0xa00)
> mmap(&(0x7f0000cbc000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> setsockopt$BT_SNDMTU(r1, 0x112, 0xc, &(0x7f0000cbc000)=0x5, 0x2)
> mmap(&(0x7f0000cbb000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> ioctl$EVIOCGEFFECTS(r1, 0x80044584, &(0x7f0000cbc000-0x942)=nil)
> r2 = fcntl$dupfd(r0, 0x406, r0)
> mmap(&(0x7f0000cbc000)=nil, (0x1000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> mmap(&(0x7f00002bf000)=nil, (0x1000), 0x3, 0x8010, 0xffffffffffffffff, 0x0)
> mmap(&(0x7f0000000000)=nil, (0x0), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> pwritev(r2, &(0x7f00007e9000)=[{&(0x7f0000cbc000)=....
>
> Do you have log of the second one available as well?
>
> CC'ing mm and perf folks.


Here is the second log:
https://gist.githubusercontent.com/dvyukov/dd7970a5daaa7a30f6d37fa5592b56de/raw/f29182024538e604c95d989f7b398816c3c595dc/gistfile1.txt

I've hit only twice. The first time I tried hard to reproduce it, with
no success. So unfortunately that's all we have.

Re logs: my setup executes up to 16 programs in parallel. So for
normal BUGs any of the preceding 16 programs can be guilty. But since
this check is asynchronous, it can be just any preceding program in
the log.

I would expect that it is triggered by some rarely-executing poorly
tested code. Maybe mmap of some device?