Re: [PATCH 10/10] Dynamic fault injection

From: Kent Overstreet
Date: Fri May 18 2018 - 14:14:45 EST


On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> >
> > Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
>
> I agree with Christoph that even if there was some explanation in the cover
> letter, there should be something at least as good in the patch itself. The
> cover letter is not saved, but the commit stays around forever, and should
> explain how this should be added to code, and how to use it from userspace.
>
>
> That said, I think this is a useful functionality. We have something similar
> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
> test a distributed filesystem, which is just a CPP macro with an unlikely()
> branch, while this looks more sophisticated. This looks like it has some
> added functionality like having more than one fault enabled at a time.
> If this lands we could likely switch our code over to using this.

This is pretty much what I was looking for, I just wanted to know if this patch
was interesting enough to anyone that I should spend more time on it or just
drop it :) Agreed on documentation. I think it's also worth factoring out the
functionality for the elf section trick that dynamic debug uses too.

> Some things that are missing from this patch that is in our code:
>
> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
> - timeout: sleep for N msec to simulate network/disk/locking delays
> - race: wait with one thread until a second thread hits matching check
>
> We also have a "fail_val" that allows making the check conditional (e.g.
> only operation on server "N" should fail, only RPC opcode "N", etc).

Those all sound like good ideas... fail_val especially, I think with that we'd
have all the functionality the existing fault injection framework has (which is
way to heavyweight to actually get used, imo)