çå: [PATCH RESEND] arm64: fault: avoid send SIGBUS two times

From: gengdongjiu
Date: Mon Dec 11 2017 - 10:39:45 EST


Hi James,
Thanks for your review and suggestion.

> Hi gengdongjiu,
>
> On 08/12/17 04:43, gengdongjiu wrote:
> > by the way, I think also change the info.si_code to "BUS_MCEERR_AR" is better, as shown [1].
> > BUS_MCEERR_AR can tell user space "Hardware memory error consumed on a error; action required".
>
> Today its also used as the last-resort. This signal tells user-space the page can't be re-read from disk/swap, and its been unmapped from all
> affected processes.
>
> I think using it like this (tempting as it is) changes the meaning.


Consider again, I think what is your said is reasonable, when the ghes_notify_sea() return failure, it means the meory_failure() does not handler the error and even not
unmapped the affected processes.
So setting the si_code to BUS_MCEERR_AR may not better, I will set the si_code to 0.
Thanks for your suggestion and reminder.

>
>
> > so it is better than "0". In the X86 platform, it also use the "BUS_MCEERR_AR" for si_code[2] in "arch/x86/mm/fault.c".
> > what do you think about it?
>
> This is heading into kernel-first territory, I'd prefer we do that all at once so we know everything is covered.

Yes, it is.

>
>
> > [2]:
> > arch/x86/mm/fault.c:
> >
> > static void
> > do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> > u32 *pkey, unsigned int fault)
> > {
> > ......
> > #ifdef CONFIG_MEMORY_FAILURE
> > if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
>
> These VM_FAULT flags indicate memory_failure() has run, tried to re-read the memory from disk/swap, failed, and unmapped the page
> from all affected processes.

Understand, These VM_FAULT flags is different with the do_sea() handling. In the do_sea(), the memory_failure() may not run.

>
>
> > printk(KERN_ERR
> > "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > tsk->comm, tsk->pid, address);
> > code = BUS_MCEERR_AR;
> > }
> > #endif
> > force_sig_info_fault(SIGBUS, code, address, tsk, pkey, fault); }
>
> This is x86's page fault handler, not its Machine-Check-Exception handler.
>
> arm64's page fault handler does this too, from do_page_fault():

Yes, indeed.
just now I check the code, you are right.

> > } else if (fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) {
> > sig = SIGBUS;
> > code = BUS_MCEERR_AR;
>
>
> If you're seeing this, its likely due to the race Xie XiuQi spotted where the recovery action has been queued, then we return to user-space
> before its done.
>
> I had a go at tackling this, adding helpers to kick the assorted queues, which we can do if we took the exception from user-space. Where I
> got stuck is whether we should still force a signal, and how signals get merged. I'll try and spend some more time on that this week.


Understand, thanks for tacking and effort.

>
>
>
> Thanks,
>
> James