Re: [PATCH 10/10] Dynamic fault injection

From: Andreas Dilger
Date: Fri May 18 2018 - 15:58:22 EST


On May 18, 2018, at 1:10 PM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>>>
>>> Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
>>
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself. The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>>
>>
>> That said, I think this is a useful functionality. We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated. This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
>
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
>
>> Some things that are missing from this patch that is in our code:
>>
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>> - timeout: sleep for N msec to simulate network/disk/locking delays
>> - race: wait with one thread until a second thread hits matching check
>>
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
>
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure". The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas





Attachment: signature.asc
Description: Message signed with OpenPGP