Re: [PATCH RFC 01/10] mm: add Kernel Electric-Fence infrastructure

From: Jonathan Cameron
Date: Mon Sep 07 2020 - 11:48:04 EST


On Mon, 7 Sep 2020 15:40:46 +0200
Marco Elver <elver@xxxxxxxxxx> wrote:

> From: Alexander Potapenko <glider@xxxxxxxxxx>
>
> This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
> low-overhead sampling-based memory safety error detector of heap
> use-after-free, invalid-free, and out-of-bounds access errors.
>
> KFENCE is designed to be enabled in production kernels, and has near
> zero performance overhead. Compared to KASAN, KFENCE trades performance
> for precision. The main motivation behind KFENCE's design, is that with
> enough total uptime KFENCE will detect bugs in code paths not typically
> exercised by non-production test workloads. One way to quickly achieve a
> large enough total uptime is when the tool is deployed across a large
> fleet of machines.
>
> KFENCE objects each reside on a dedicated page, at either the left or
> right page boundaries. The pages to the left and right of the object
> page are "guard pages", whose attributes are changed to a protected
> state, and cause page faults on any attempted access to them. Such page
> faults are then intercepted by KFENCE, which handles the fault
> gracefully by reporting a memory access error.
>
> Guarded allocations are set up based on a sample interval (can be set
> via kfence.sample_interval). After expiration of the sample interval, a
> guarded allocation from the KFENCE object pool is returned to the main
> allocator (SLAB or SLUB). At this point, the timer is reset, and the
> next allocation is set up after the expiration of the interval.
>
> To enable/disable a KFENCE allocation through the main allocator's
> fast-path without overhead, KFENCE relies on static branches via the
> static keys infrastructure. The static branch is toggled to redirect the
> allocation to KFENCE. To date, we have verified by running synthetic
> benchmarks (sysbench I/O workloads) that a kernel compiled with KFENCE
> is performance-neutral compared to the non-KFENCE baseline.
>
> For more details, see Documentation/dev-tools/kfence.rst (added later in
> the series).
>
> Co-developed-by: Marco Elver <elver@xxxxxxxxxx>
> Signed-off-by: Marco Elver <elver@xxxxxxxxxx>
> Signed-off-by: Alexander Potapenko <glider@xxxxxxxxxx>

Interesting bit of work. A few trivial things inline I spotted whilst having
a first read through.

Thanks,

Jonathan

> +
> +static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
> +{
> + /*
> + * Note: for allocations made before RNG initialization, will always
> + * return zero. We still benefit from enabling KFENCE as early as
> + * possible, even when the RNG is not yet available, as this will allow
> + * KFENCE to detect bugs due to earlier allocations. The only downside
> + * is that the out-of-bounds accesses detected are deterministic for
> + * such allocations.
> + */
> + const bool right = prandom_u32_max(2);
> + unsigned long flags;
> + struct kfence_metadata *meta = NULL;
> + void *addr = NULL;

I think this is set in all paths, so no need to initialize here.

> +
> + /* Try to obtain a free object. */
> + raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
> + if (!list_empty(&kfence_freelist)) {
> + meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
> + list_del_init(&meta->list);
> + }
> + raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
> + if (!meta)
> + return NULL;
> +
> + if (unlikely(!raw_spin_trylock_irqsave(&meta->lock, flags))) {
> + /*
> + * This is extremely unlikely -- we are reporting on a
> + * use-after-free, which locked meta->lock, and the reporting
> + * code via printk calls kmalloc() which ends up in
> + * kfence_alloc() and tries to grab the same object that we're
> + * reporting on. While it has never been observed, lockdep does
> + * report that there is a possibility of deadlock. Fix it by
> + * using trylock and bailing out gracefully.
> + */
> + raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
> + /* Put the object back on the freelist. */
> + list_add_tail(&meta->list, &kfence_freelist);
> + raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
> +
> + return NULL;
> + }
> +
> + meta->addr = metadata_to_pageaddr(meta);
> + /* Unprotect if we're reusing this page. */
> + if (meta->state == KFENCE_OBJECT_FREED)
> + kfence_unprotect(meta->addr);
> +
> + /* Calculate address for this allocation. */
> + if (right)
> + meta->addr += PAGE_SIZE - size;
> + meta->addr = ALIGN_DOWN(meta->addr, cache->align);
> +
> + /* Update remaining metadata. */
> + metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED);
> + /* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
> + WRITE_ONCE(meta->cache, cache);
> + meta->size = right ? -size : size;
> + for_each_canary(meta, set_canary_byte);
> + virt_to_page(meta->addr)->slab_cache = cache;
> +
> + raw_spin_unlock_irqrestore(&meta->lock, flags);
> +
> + /* Memory initialization. */
> +
> + /*
> + * We check slab_want_init_on_alloc() ourselves, rather than letting
> + * SL*B do the initialization, as otherwise we might overwrite KFENCE's
> + * redzone.
> + */
> + addr = (void *)meta->addr;
> + if (unlikely(slab_want_init_on_alloc(gfp, cache)))
> + memzero_explicit(addr, size);
> + if (cache->ctor)
> + cache->ctor(addr);
> +
> + if (CONFIG_KFENCE_FAULT_INJECTION && !prandom_u32_max(CONFIG_KFENCE_FAULT_INJECTION))
> + kfence_protect(meta->addr); /* Random "faults" by protecting the object. */
> +
> + atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);
> + atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);
> +
> + return addr;
> +}

...

> +
> +size_t kfence_ksize(const void *addr)
> +{
> + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + /*
> + * Read locklessly -- if there is a race with __kfence_alloc(), this
> + * most certainly is either a use-after-free, or invalid access.
> + */
> + return meta ? abs(meta->size) : 0;
> +}
> +
> +void *kfence_object_start(const void *addr)
> +{
> + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + /*
> + * Read locklessly -- if there is a race with __kfence_alloc(), this
> + * most certainly is either a use-after-free, or invalid access.

To my reading using "most certainly" makes this statement less clear

Read locklessly -- if there is a race with __kfence_alloc() this
is either a use-after-free or invalid access.

Same for other cases of that particular "most certainly".

> + */
> + return meta ? (void *)meta->addr : NULL;
> +}
> +
> +void __kfence_free(void *addr)
> +{
> + struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
> +
> + if (unlikely(meta->cache->flags & SLAB_TYPESAFE_BY_RCU))
> + call_rcu(&meta->rcu_head, rcu_guarded_free);
> + else
> + kfence_guarded_free(addr, meta);
> +}
> +
> +bool kfence_handle_page_fault(unsigned long addr)
> +{
> + const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
> + struct kfence_metadata *to_report = NULL;
> + enum kfence_error_type error_type;
> + unsigned long flags;
> +
> + if (!is_kfence_address((void *)addr))
> + return false;
> +
> + if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */
> + return kfence_unprotect(addr); /* ... unprotect and proceed. */
> +
> + atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
> +
> + if (page_index % 2) {
> + /* This is a redzone, report a buffer overflow. */
> + struct kfence_metadata *meta = NULL;

Not need to set to NULL here as assigned 3 lines down.

> + int distance = 0;
> +
> + meta = addr_to_metadata(addr - PAGE_SIZE)

> + if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
> + to_report = meta;
> + /* Data race ok; distance calculation approximate. */
> + distance = addr - data_race(meta->addr + abs(meta->size));
> + }
> +
> + meta = addr_to_metadata(addr + PAGE_SIZE);
> + if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
> + /* Data race ok; distance calculation approximate. */
> + if (!to_report || distance > data_race(meta->addr) - addr)
> + to_report = meta;
> + }
> +
> + if (!to_report)
> + goto out;
> +
> + raw_spin_lock_irqsave(&to_report->lock, flags);
> + to_report->unprotected_page = addr;
> + error_type = KFENCE_ERROR_OOB;
> +
> + /*
> + * If the object was freed before we took the look we can still
> + * report this as an OOB -- the report will simply show the
> + * stacktrace of the free as well.
> + */
> + } else {
> + to_report = addr_to_metadata(addr);
> + if (!to_report)
> + goto out;
> +
> + raw_spin_lock_irqsave(&to_report->lock, flags);
> + error_type = KFENCE_ERROR_UAF;
> + /*
> + * We may race with __kfence_alloc(), and it is possible that a
> + * freed object may be reallocated. We simply report this as a
> + * use-after-free, with the stack trace showing the place where
> + * the object was re-allocated.
> + */
> + }
> +
> +out:
> + if (to_report) {
> + kfence_report_error(addr, to_report, error_type);
> + raw_spin_unlock_irqrestore(&to_report->lock, flags);
> + } else {
> + /* This may be a UAF or OOB access, but we can't be sure. */
> + kfence_report_error(addr, NULL, KFENCE_ERROR_INVALID);
> + }
> +
> + return kfence_unprotect(addr); /* Unprotect and let access proceed. */
> +}
...