Re: Updated scalable urandom patchkit

From: George Spelvin
Date: Mon Oct 12 2015 - 03:49:16 EST


>> I'm not thrilled with incrementing the pointer from i to len, but mixing
>> at positions i+k to i+k+len. The whole LFSR scheme relies on a regular
>> pass structure.

> That part I'm not worried about. We still have a regular pass
> structure --- since for each CPU, we are still iterating over the pool
> in a regular fashion.

Not if I understand what you are doing.

Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12.
(I'll also use a count-up index even though I know it's actually
count-down in the code.)

CPU0 writes 5 words at 0..4, leaving add_ptr = 5.
Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10
The CPU 0 writes its next word at 10.

Words 5..9 never got mixed at all.

>> How about this instead: drop the hashed offset, and instead let each
>> writer do an atomic_add_return on the index, then iterate over the
>> "reserved" slots. Concurrent additions will at least do non-overlapping
>> writes until the number equals the pool size.

> One atomic operation per byte that we're mixing in? That's quite
> expensive.

Well, for this part, it's one atomic operation per _mix_pool_bytes:
i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes;

However, I *also* have to use atomic_add to write r->pool[i],
which is indeed expensive. PoC diff attached, but I'm not sure if
you'll like it.

> The input_rotate is useful for the input pool, for things like
> keyboard code so we can make sure they enter the pool at more points
> than just the low bits of each word. But for the output pools, it
> really doesn't make any sense. And we are getting to the point where
> we may end up having different mixing algorithms for the nonblocking
> pool, and in that case I have absolutely no trouble dropping the
> input_rotate part of the mixing algorithm for the non-blocking pool.

You're forgetting: keyboard codes don't go into the input pool.
They go into the per-CPU fast pool, which does some quite thorough
mixing and delivers 128 bits with the entropy effectively distributed
across it.

These days, the only thing that goes into the large pools are
the output from other pools and /dev/random writes.

But also, that's what the twist_table achieves. By the time the
pool wraps around, the previous value at a given array posistion
has been read and twisted many times, and is now smeared across
all 32 bits.

If you want larger shifts *and* greater efficiency by removing a table
lookup from the critical path, I'll happily replace it with an Xorshift
structure (will take me a little while; I've generated primitive
polynomials in the past but need to rewrite the code).

>> Or a small machine with a couple of concurrent /dev/urandom abusers.
>> Remember, it's globally readable, so it has to be resistance to malicious
>> abuse.

> One of the other ways we could solve this is by hanging a struct off
> the task structure, and if we detect that we have a /dev/urandom
> abuser, we give that process its own urandom pool, so any pissing that
> it does will be in its own pool. (So to speak.)

Let me make sure we're talking about the same thing here.

Segregating abusers so they don't cause *performance* problems
for other threads is a reasonable optimization.

But the quoted conversation was talking about a *security*
problem that could be created by /dev/urandom abusers if some
of your lock-relaxing ideas were followed.

I was observing that the failure case you were hoping was rare could
be made common by a malicious user.

And for that, I don't think a cacheing wrapper is the way to solve
the problem.

For security, either the underlying pool is robust or fragile. If it's
robust, we don't need this extra layer; abusers will get shitty
performance but won't hurt the quality of the output.

If we need the layer, then we have to ask if we understand the
vulnerability to abuse and have successfully segregated all damaging
forms of abuse. That's more analysis and design effort than simply
eliminating the problem in the first place.

> Most processes don't actually use that much randomness, and I'm not
> that worried about in-kernel users of the nonblocking pool. Even with
> the most exec-heavy workload, setting up a new exec image is
> heavyweight enough that you're really not going to be contending on
> the lock. I also have trouble with someone spending $$$$ on a system
> with 1K cpu cores and wasting all of their CPU power with running
> shell scripts that fork and exec a lot. :-)

But I just re-read Andi's original trouble report, and although he mentions
AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's
causing the problem *not* execve. My memory may have gotten confused.

I can imagine that one /dev/urandom abuser can minopolize the locck and
cause problems for a shell script.

But yes, for this performance problem, a segregate-the-abusers solution
is quite attractive.

> Atomic-add primitives aren't portable either. The representation
> isn't guaranteed to be 32-bits, and some platforms an atomic int is
> only 24-bits wide (the top 8 bits being used for locking purposes).

No longer true; see Documentation/atomic_ops.txt.

But yes, atomic ops are *ridiculously* expensive on SMP 32-bit SPARC.

Do we care? Is anyone using multiprocessor V7 SPARC in anger these days?

>> There are several possible solutions that don't need separate pools
>> (including separate add-back pools, with a shared seeded pool that
>> is never touched by add-back), so I don't think it's necessary to
>> give up yet.

> Hmm, maybe. I'm a bit worried about the amount of complexity that
> this entails, and the reality is that the urandom pool or pools don't
> provide anything other than cryptogaphic randomness.

Well, *one* add-back pool would also do, and be a quite simple addition
to the code.

> the urandom pool or pools don't
> provide anything other than cryptogaphic randomness.

Well, there's cryptographic and there's cryptographic.
/dev/urandom is deliberately *ridiculosuly* overengineered;
It's not dependent on the security of a single primitive.

I'd rather not downgrade it to depending on the security of AES, or
ChaCha, or Keccak, or any other performance-optimized primitive.

For my own code, sure. But /dev/random is sold as "you *really* don't
have to worry about it".

> At this point, I wonder if it might not be simpler to restrict the
> current nonblocking pool to kernel users, and for userspace users, the
> first time a process reads from /dev/urandom or calls getrandom(2), we
> create for them a ChaCha20 CRNG, which hangs off of the task
> structure. This would require about 72 bytes of state per process,
> but normally very few processes are reading from /dev/urandom or
> calling getrandom(2) from userspace.

Interesting idea, although the loss of backtracking protection for
long-running servers is a significant semantic change.

Remember that making antibacktracking work is the entire problem we're
struggling to solve. If discarding it is on the table, I can solve
the scalability problems extremely easily. Even without giving it up
completely, just put the add-back inside if(trylock()) and the problem
goes away.

I'm also not sure where you get 72 bytes from. You need 32 bytes of
key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the
ChaCha input block.

Then optionally add 64 bytes of output buffer if you don't want to
discard generated bytes after an unaligned read.

It adds up to 44/48, or 108/112 bytes. Where does 72 come from?

But why bother with even that much? If you're willing to discard
antibacktracking, just have *one* key for the kernel, reseeded more
often, and a per-thread nonce and stream position.

> (Arguably the right answer is to put arc4random in libc, where it can
> automatically handle forks/clones/pthread automatically, but it seems
> pretty clear *that* train has left a long time ago.)

I'm not quite which viatical incident you're alluding to, but yes that
would be the preferred solution.

> I have a feeling this may be less code and complexity, and it nicely
> handles the case where we have a /dev/urandom abuser who feels that
> they want to call /dev/urandom in a tight loop, even on a 4 socket
> Xeon system. :-)

I'm not sanguine about its simplicity; it's a whole other layer.

Might be a good idea anyway. I have a similar patch set in work, but
it's called /dev/frandom because I wasn't willing to suggest such a
semantic change.

But without interfering with legitimate users of /dev/urandom at all,
I'd be quite willing to say that as soon as you read more than 32 bytes
(current request plus recent history) from /dev/urandom, you get a
private ChaCha20 structure to read from.

(This is just enforcing the usage rules from the random(4) man page.
Since I wrote that text, I'm hardly going to object!)



Here's that atomic_t patch I mentioned above.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e62b30ba..e65357c4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
#include <linux/irq.h>
#include <linux/syscalls.h>
#include <linux/completion.h>
+#include <linux/hash.h>

#include <asm/processor.h>
#include <asm/uaccess.h>
@@ -423,7 +424,7 @@ struct entropy_store;
struct entropy_store {
/* read-only data: */
const struct poolinfo *poolinfo;
- __u32 *pool;
+ atomic_t *pool;
const char *name;
struct entropy_store *pull;
struct work_struct push_work;
@@ -431,20 +432,20 @@ struct entropy_store {
/* read-write data: */
unsigned long last_pulled;
spinlock_t lock;
- unsigned short add_ptr;
- unsigned short input_rotate;
+ atomic_t add_ptr;
int entropy_count;
int entropy_total;
unsigned int initialized:1;
unsigned int limit:1;
unsigned int last_data_init:1;
+ unsigned char input_rotate;
__u8 last_data[EXTRACT_SIZE];
};

static void push_to_pool(struct work_struct *work);
-static __u32 input_pool_data[INPUT_POOL_WORDS];
-static __u32 blocking_pool_data[OUTPUT_POOL_WORDS];
-static __u32 nonblocking_pool_data[OUTPUT_POOL_WORDS];
+static atomic_t input_pool_data[INPUT_POOL_WORDS];
+static atomic_t blocking_pool_data[OUTPUT_POOL_WORDS];
+static atomic_t nonblocking_pool_data[OUTPUT_POOL_WORDS];

static struct entropy_store input_pool = {
.poolinfo = &poolinfo_table[0],
@@ -488,6 +489,10 @@ static __u32 const twist_table[8] = {
* degree, and then twisted. We twist by three bits at a time because
* it's cheap to do so and helps slightly in the expected case where
* the entropy is concentrated in the low-order bits.
+ *
+ * This function is designed to be safe to call without locking the
+ * entropy_store. Race conditions might do strange things, but will
+ * leave the pool valid and not lose entropy.
*/
static void _mix_pool_bytes(struct entropy_store *r, const void *in,
int nbytes)
@@ -496,7 +501,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
int input_rotate;
int wordmask = r->poolinfo->poolwords - 1;
const char *bytes = in;
- __u32 w;
+
+ BUILD_BUG_ON(sizeof(int) != sizeof(__u32));

tap1 = r->poolinfo->tap1;
tap2 = r->poolinfo->tap2;
@@ -505,23 +511,31 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
tap5 = r->poolinfo->tap5;

input_rotate = r->input_rotate;
- i = r->add_ptr;
+ i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes;

/* mix one byte at a time to simplify size handling and churn faster */
while (nbytes--) {
- w = rol32(*bytes++, input_rotate);
- i = (i - 1) & wordmask;
+ __u32 v, w = rol32(*bytes++, input_rotate);

/* XOR in the various taps */
- w ^= r->pool[i];
- w ^= r->pool[(i + tap1) & wordmask];
- w ^= r->pool[(i + tap2) & wordmask];
- w ^= r->pool[(i + tap3) & wordmask];
- w ^= r->pool[(i + tap4) & wordmask];
- w ^= r->pool[(i + tap5) & wordmask];
+ i = (i - 1) & wordmask;
+ w ^= atomic_read(r->pool + ((i + tap1) & wordmask));
+ w ^= atomic_read(r->pool + ((i + tap2) & wordmask));
+ w ^= atomic_read(r->pool + ((i + tap3) & wordmask));
+ w ^= atomic_read(r->pool + ((i + tap4) & wordmask));
+ w ^= atomic_read(r->pool + ((i + tap5) & wordmask));
+ w ^= v = atomic_read(r->pool + i);
+ /* Twist the result to mix bits within words */
+ w = (w >> 3) ^ twist_table[w & 7];

- /* Mix the result back in with a twist */
- r->pool[i] = (w >> 3) ^ twist_table[w & 7];
+ /*
+ * Now, to put the result back, we don't have an
+ * atomic_xor operation, but we do have an atomic_add.
+ * Atomically add the value which would cause the desired
+ * change. If there's a race, something strange will happen,
+ * but we won't lose entropy.
+ */
+ atomic_add(w - v, r->pool + i);

/*
* Normally, we add 7 bits of rotation to the pool.
@@ -532,8 +546,13 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
input_rotate = (input_rotate + (i ? 7 : 14)) & 31;
}

- r->input_rotate = input_rotate;
- r->add_ptr = i;
+ /*
+ * Access to input_rotate is not thread-safe, because the
+ * actual value doesn't matter to the security guarantees,
+ * and the fact that it changes is enough for its purpose.
+ * So if there's a race condition, it doesn't matter who wins.
+ */
+ r->input_rotate = (unsigned char)input_rotate;
}

static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1109,17 +1128,26 @@ retry:
* This function does the actual extraction for extract_entropy and
* extract_entropy_user.
*
+ * For the blocking pools, we lock the pool until we feed back the
+ * extracted entropy, so the next extract_buf will return different
+ * results.
+ *
+ * For the non-blocking pool, concurrent access to the pool is allowed,
+ * and the CPU id is used as a salt to ensure that concurrent readers
+ * will get different results.
+ *
* Note: we assume that .poolwords is a multiple of 16 words.
*/
static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
{
int i;
+ bool is_blocking = r->limit;
union {
__u32 w[5];
unsigned long l[LONGS(20)];
} hash;
__u32 workspace[SHA_WORKSPACE_WORDS];
- unsigned long flags;
+ unsigned long flags = flags; /* "initialization" shuts up GCC */

/*
* If we have an architectural hardware random number
@@ -1133,8 +1161,11 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
hash.l[i] = v;
}

+ if (is_blocking)
+ spin_lock_irqsave(&r->lock, flags);
+ else
+ hash.w[0] ^= hash_32(get_cpu(), 32);
/* Generate a hash across the pool, 16 words (512 bits) at a time */
- spin_lock_irqsave(&r->lock, flags);
for (i = 0; i < r->poolinfo->poolwords; i += 16)
sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);

@@ -1148,7 +1179,10 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
* hash.
*/
__mix_pool_bytes(r, hash.w, sizeof(hash.w));
- spin_unlock_irqrestore(&r->lock, flags);
+ if (is_blocking)
+ spin_unlock_irqrestore(&r->lock, flags);
+ else
+ put_cpu();

memzero_explicit(workspace, sizeof(workspace));

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/