Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

From: NeilBrown
Date: Tue May 09 2017 - 18:03:47 EST


On Tue, May 09 2017, Jeff Layton wrote:

> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
>
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
>
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
>
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
>
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
>
> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>

I like that this is a separate lib/*.c - nicely structured too.

Reviewed-by: NeilBrown <neilb@xxxxxxxx>

Thanks,
NeilBrown


> ---
> include/linux/errseq.h | 19 +++++
> lib/Makefile | 2 +-
> lib/errseq.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 219 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/errseq.h
> create mode 100644 lib/errseq.c
>
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> new file mode 100644
> index 000000000000..0d2555f310cd
> --- /dev/null
> +++ b/include/linux/errseq.h
> @@ -0,0 +1,19 @@
> +#ifndef _LINUX_ERRSEQ_H
> +#define _LINUX_ERRSEQ_H
> +
> +/* See lib/errseq.c for more info */
> +
> +typedef u32 errseq_t;
> +
> +void __errseq_set(errseq_t *eseq, int err);
> +static inline void errseq_set(errseq_t *eseq, int err)
> +{
> + /* Optimize for the common case of no error */
> + if (unlikely(err))
> + __errseq_set(eseq, err);
> +}
> +
> +errseq_t errseq_sample(errseq_t *eseq);
> +int errseq_check(errseq_t *eseq, errseq_t since);
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..2423afef40f7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
> gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
> bsearch.o find_bit.o llist.o memweight.o kfifo.o \
> percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
> - once.o refcount.o
> + once.o refcount.o errseq.o
> obj-y += string_helpers.o
> obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
> obj-y += hexdump.o
> diff --git a/lib/errseq.c b/lib/errseq.c
> new file mode 100644
> index 000000000000..0f8b4ed460f0
> --- /dev/null
> +++ b/lib/errseq.c
> @@ -0,0 +1,199 @@
> +#include <linux/err.h>
> +#include <linux/bug.h>
> +#include <linux/atomic.h>
> +#include <linux/errseq.h>
> +
> +/*
> + * An errseq_t is a way of recording errors in one place, and allowing any
> + * number of "subscribers" to tell whether it has changed since an arbitrary
> + * time of their choosing.
> + *
> + * It's implemented as an unsigned 32-bit value. The low order bits are
> + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
> + * are used as a counter. This is done with atomics instead of locking so that
> + * these functions can be called from any context.
> + *
> + * The general idea is for consumers to sample an errseq_t value at a
> + * particular point in time. Later, that value can be used to tell whether any
> + * new errors have occurred since that time.
> + *
> + * Note that there is a risk of collisions, if new errors are being recorded
> + * frequently, since we have so few bits to use as a counter.
> + *
> + * To mitigate this, one bit is used as a flag to tell whether the value has
> + * been sampled since a new value was recorded. That allows us to avoid bumping
> + * the counter if no one has sampled it since the last time an error was
> + * recorded.
> + *
> + * A new errseq_t should always be zeroed out. A errseq_t value of all zeroes
> + * is the special (but common) case where there has never been an error. An all
> + * zero value thus serves as the "epoch" if one wishes to know whether there
> + * has ever been an error set since it was first initialized.
> + */
> +
> +/* The low bits are designated for error code (max of MAX_ERRNO) */
> +#define ERRSEQ_SHIFT ilog2(MAX_ERRNO + 1)
> +
> +/* This bit is used as a flag to indicate whether the value has been seen */
> +#define ERRSEQ_SEEN (1 << ERRSEQ_SHIFT)
> +
> +/* The "ones" bit for the counter */
> +#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
> +
> +/**
> + * __errseq_set - set a errseq_t for later reporting
> + * @eseq: errseq_t field that should be set
> + * @err: error to set
> + *
> + * This function sets the error in *eseq, and increments the sequence counter
> + * if the last sequence was sampled at some point in the past.
> + *
> + * Any error set will always overwrite an existing error.
> + *
> + * Most callers will want to use the errseq_set inline wrapper to efficiently
> + * handle the common case where err is 0.
> + */
> +void __errseq_set(errseq_t *eseq, int err)
> +{
> + errseq_t old;
> +
> + /* MAX_ERRNO must be able to serve as a mask */
> + BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
> +
> + /*
> + * Ensure the error code actually fits where we want it to go. If it
> + * doesn't then just throw a warning and don't record anything. We
> + * also don't accept zero here as that would effectively clear a
> + * previous error.
> + */
> + if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
> + "err = %d\n", err))
> + return;
> +
> + old = READ_ONCE(*eseq);
> + for (;;) {
> + errseq_t new, cur;
> +
> + /* Clear out error bits and set new error */
> + new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
> +
> + /* Only increment if someone has looked at it */
> + if (old & ERRSEQ_SEEN)
> + new += ERRSEQ_CTR_INC;
> +
> + /* If there would be no change, then call it done */
> + if (new == old)
> + break;
> +
> + /* Try to swap the new value into place */
> + cur = cmpxchg(eseq, old, new);
> +
> + /*
> + * Call it success if we did the swap or someone else beat us
> + * to it for the same value.
> + */
> + if (likely(cur == old || cur == new))
> + break;
> +
> + /* Raced with an update, try again */
> + old = cur;
> + }
> +}
> +EXPORT_SYMBOL(__errseq_set);
> +
> +/**
> + * errseq_sample - grab current errseq_t value
> + * @eseq: pointer to errseq_t to be sampled
> + *
> + * This function allows callers to sample an errseq_t value, marking it as
> + * "seen" if required.
> + */
> +errseq_t errseq_sample(errseq_t *eseq)
> +{
> + errseq_t old = READ_ONCE(*eseq);
> + errseq_t new = old;
> +
> + /*
> + * For the common case of no errors ever having been set, we can skip
> + * marking the SEEN bit. Once an error has been set, the value will
> + * never go back to zero.
> + */
> + if (old != 0) {
> + new |= ERRSEQ_SEEN;
> + if (old != new)
> + cmpxchg(eseq, old, new);
> + }
> + return new;
> +}
> +EXPORT_SYMBOL(errseq_sample);
> +
> +/**
> + * errseq_check - has an error occurred since a particular point in time?
> + * @eseq: pointer to errseq_t value to be checked
> + * @since: previously-sampled errseq_t from which to check
> + *
> + * Grab the value that eseq points to, and see if it has changed "since"
> + * the given value was sampled. The "since" value is not advanced, so there
> + * is no need to mark the value as seen.
> + *
> + * Returns the latest error set in the errseq_t or 0 if it hasn't changed.
> + */
> +int errseq_check(errseq_t *eseq, errseq_t since)
> +{
> + errseq_t cur = READ_ONCE(*eseq);
> +
> + if (likely(cur == since))
> + return 0;
> + return -(cur & MAX_ERRNO);
> +}
> +EXPORT_SYMBOL(errseq_check);
> +
> +/**
> + * errseq_check_and_advance - check an errseq_t and advance it to the current value
> + * @eseq: pointer to value being checked reported
> + * @since: pointer to previously-sampled errseq_t to check against and advance
> + *
> + * Grab the eseq value, and see whether it matches the value that "since"
> + * points to. If it does, then just return 0.
> + *
> + * If it doesn't, then the value has changed. Set the "seen" flag, and try to
> + * swap it into place as the new eseq value. Then, set that value as the new
> + * "since" value, and return whatever the error portion is set to.
> + *
> + * Note that no locking is provided here for concurrent updates to the "since"
> + * value. The caller must provide that if necessary. Because of this, callers
> + * may want to do a lockless errseq_check before taking the lock and calling
> + * this.
> + */
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> + int err = 0;
> + errseq_t old, new;
> +
> + /*
> + * Most callers will want to use the inline wrapper to check this,
> + * so that the common case of no error is handled without needing
> + * to lock.
> + */
> + old = READ_ONCE(*eseq);
> + if (old != *since) {
> + /*
> + * Set the flag and try to swap it into place if it has
> + * changed.
> + *
> + * We don't care about the outcome of the swap here. If the
> + * swap doesn't occur, then it has either been updated by a
> + * writer who is bumping the seq count anyway, or another
> + * reader who is just setting the "seen" flag. Either outcome
> + * is OK, and we can advance "since" and return an error based
> + * on what we have.
> + */
> + new = old | ERRSEQ_SEEN;
> + if (new != old)
> + cmpxchg(eseq, old, new);
> + *since = new;
> + err = -(new & MAX_ERRNO);
> + }
> + return err;
> +}
> +EXPORT_SYMBOL(errseq_check_and_advance);
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature