Re: [PATCH v2] dm crypt: add flags to optionally bypass dm-crypt workqueues

From: Damien Le Moal
Date: Tue Jun 30 2020 - 23:10:43 EST


On 2020/06/30 18:35, Ignat Korchagin wrote:
[...]
>>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
>>> index 000ddfab5ba0..6924eb49b1df 100644
>>> --- a/drivers/md/dm-crypt.c
>>> +++ b/drivers/md/dm-crypt.c
>>> @@ -69,6 +69,7 @@ struct dm_crypt_io {
>>> u8 *integrity_metadata;
>>> bool integrity_metadata_from_pool;
>>> struct work_struct work;
>>> + struct tasklet_struct tasklet;
>>>
>>> struct convert_context ctx;
>>>
>>> @@ -127,7 +128,8 @@ struct iv_elephant_private {
>>> * and encrypts / decrypts at the same time.
>>> */
>>> enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
>>> - DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
>>> + DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD,
>>> + DM_CRYPT_NO_READ_WORKQUEUE, DM_CRYPT_NO_WRITE_WORKQUEUE };
>>
>> I liked the "INLINE" naming. What about DM_CRYPT_READ_INLINE and
>> DM_CRYPT_WRITE_INLINE ? Shorter too :)
>>
>> But from the changes below, it looks like your change is now less about being
>> purely inline or synchronous but about bypassing the workqueue.
>> Is this correct ?
>
> Yes, from the test with the NULL cipher it is clearly visible that
> workqueues are the main cause of the performance degradation. The
> previous patch actually did the same thing with the addition of a
> custom xts-proxy synchronous module, which achieved "full inline"
> processing. But it is clear now, that inline/non-inline Crypto API
> does not change much from a performance point of view.

OK. Understood. So the name DM_CRYPT_NO_READ_WORKQUEUE and
DM_CRYPT_NO_WRITE_WORKQUEUE make sense. They indeed are very descriptive.
I was just wondering how to avoid confusion with the DM_CRYPT_NO_OFFLOAD flag
for writes with better names. But I do not have better ideas :)

>
>>>
>>> enum cipher_flags {
>>> CRYPT_MODE_INTEGRITY_AEAD, /* Use authenticated mode for cihper */
>>> @@ -1449,7 +1451,7 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>>> int error);
>>>
>>> static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>>> - struct convert_context *ctx)
>>> + struct convert_context *ctx, bool nobacklog)
>>> {
>>> unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
>>>
>>> @@ -1463,12 +1465,12 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>>> * requests if driver request queue is full.
>>> */
>>> skcipher_request_set_callback(ctx->r.req,
>>> - CRYPTO_TFM_REQ_MAY_BACKLOG,
>>> + nobacklog ? 0 : CRYPTO_TFM_REQ_MAY_BACKLOG,
>>> kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>
>> Will not specifying CRYPTO_TFM_REQ_MAY_BACKLOG always cause the crypto API to
>> return -EBUSY ? From the comment above the skcipher_request_set_callback(), it
>> seems that this will be the case only if the skcipher diver queue is full. So in
>> other word, keeping the kcryptd_async_done() callback and executing the skcipher
>> request through crypt_convert() and crypt_convert_block_skcipher() may still end
>> up being an asynchronous operation. Can you confirm this and is it what you
>> intended to implement ?
>
> Yes, so far these flags should bypass dm-crypt workqueues only. I had
> a quick look around CRYPTO_TFM_REQ_MAY_BACKLOG and it seems that both
> generic xts as well as aesni implementations (and other crypto
> involved in disk encryption) do not have any logic related to the
> flag, so we may as well leave it as is.

OK. Sounds good. Less changes :)

>> From my understanding of the crypto API, and from what Eric commented, a truly
>> synchronous/inline execution of the skcypher needs a call like:
>>
>> crypto_wait_req(crypto_skcipher_encrypt(req), &wait);
>>
>> For SMR use case were we must absolutely preserve the write requests order, the
>> above change will probably be needed. Will check again.
>
> I think this is not an "inline" execution, rather blocking the current
> thread and waiting for the potential asynchronous crypto thread to
> finish its operation.

Well, if we block waiting for the crypto execution, crypto use becomes "inline"
in the context of the BIO submitter, so the write request order is preserved.
More a serialization than pure inlining, sure. But in the end, exactly what is
needed for zoned block device writes.

> It seems we have different use-cases here. By bypassing workqueues we
> just want to improve performance, but otherwise do not really care
> about the order of requests.

Yes. Understood. Not using the current workqueue mechanism for writes to zoned
devices is necessary because of write ordering. The performance aspect of that
is the cherry on top of the SMR support cake :)

> Waiting for crypto to complete synchronously may actually decrease
> performance, but required to preserve the order in some cases. Should
> this be a yet another flag?

Yes, blocking may be a performance concern. I will be checking this carefully.
As for another flag, I do not think one is needed:
1) Using bdev_is_zoned(), zoned drives can be trivially identified and
DM_CRYPT_NO_WRITE_WORKQUEUE forced-set.
2) Any other additional change needed for zoned block device write requests
handling can be conditional on bdev_is_zoned() & bio_op() == REQ_OP_WRITE.

Currently, for zoned block device write, I see 2 different approaches I need to
check & test:

1) If the crypto API calls with BACKLOG set preserve request order (creq
completion order is the same as issuing order), then all that is needed is force
setting DM_CRYPT_NO_WRITE_WORKQUEUE for zoned devices.
2) If (1) does not hold, then excuting encrypt operations with crypto_wait_req()
should work. Blocking may be an issue with performance though.

Another possible approach may be to use a modified write_tree/write_thread to
include the crypto calls together with backend BIO issuing in sector order. But
that may just add unnecessary context switches.

Once you send a v3 of your patch with BACKLOG fix and other cleanups, I will
rebase my work and try different things.

Thanks !

>
>>> }
>>>
>>> static void crypt_alloc_req_aead(struct crypt_config *cc,
>>> - struct convert_context *ctx)
>>> + struct convert_context *ctx, bool nobacklog)
>>> {
>>> if (!ctx->r.req_aead)
>>> ctx->r.req_aead = mempool_alloc(&cc->req_pool, GFP_NOIO);
>>> @@ -1480,17 +1482,17 @@ static void crypt_alloc_req_aead(struct crypt_config *cc,
>>> * requests if driver request queue is full.
>>> */
>>> aead_request_set_callback(ctx->r.req_aead,
>>> - CRYPTO_TFM_REQ_MAY_BACKLOG,
>>> + nobacklog ? 0 : CRYPTO_TFM_REQ_MAY_BACKLOG,
>>> kcryptd_async_done, dmreq_of_req(cc, ctx->r.req_aead));
>>> }
>>>
>>> static void crypt_alloc_req(struct crypt_config *cc,
>>> - struct convert_context *ctx)
>>> + struct convert_context *ctx, bool nobacklog)
>>> {
>>> if (crypt_integrity_aead(cc))
>>> - crypt_alloc_req_aead(cc, ctx);
>>> + crypt_alloc_req_aead(cc, ctx, nobacklog);
>>> else
>>> - crypt_alloc_req_skcipher(cc, ctx);
>>> + crypt_alloc_req_skcipher(cc, ctx, nobacklog);
>>> }
>>>
>>> static void crypt_free_req_skcipher(struct crypt_config *cc,
>>> @@ -1523,7 +1525,7 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_
>>> * Encrypt / decrypt data from one bio to another one (can be the same one)
>>> */
>>> static blk_status_t crypt_convert(struct crypt_config *cc,
>>> - struct convert_context *ctx)
>>> + struct convert_context *ctx, bool noresched)
>>
>> "noresched" is named after what will happen, not after the reason for it. So to
>> clarify, why not rename this as "convert_inline" or "do_inline" ?
>>
>>> {
>>> unsigned int tag_offset = 0;
>>> unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
>>> @@ -1533,7 +1535,7 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
>>>
>>> while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
>>>
>>> - crypt_alloc_req(cc, ctx);
>>> + crypt_alloc_req(cc, ctx, noresched);
>>> atomic_inc(&ctx->cc_pending);
>>>
>>> if (crypt_integrity_aead(cc))
>>> @@ -1566,7 +1568,8 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
>>> atomic_dec(&ctx->cc_pending);
>>> ctx->cc_sector += sector_step;
>>> tag_offset++;
>>> - cond_resched();
>>> + if (!noresched)
>>> + cond_resched();
>>> continue;
>>> /*
>>> * There was a data integrity error.
>>> @@ -1879,6 +1882,9 @@ static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io, int async)
>>> unsigned long flags;
>>> sector_t sector;
>>> struct rb_node **rbp, *parent;
>>> + bool nosort =
>>> + (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) ||
>>> + test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags);
>>
>> "nosort" is a little obscure as a name. Why not just "do_inline" ? In any case,
>> since this bool is used only in the if below, you could just write the condition
>> directly there.
>>
>>>
>>> if (unlikely(io->error)) {
>>> crypt_free_buffer_pages(cc, clone);
>>> @@ -1892,7 +1898,7 @@ static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io, int async)
>>>
>>> clone->bi_iter.bi_sector = cc->start + io->sector;
>>>
>>> - if (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) {
>>> + if (nosort) {
>>> generic_make_request(clone);
>>> return;
>>> }
>>> @@ -1941,7 +1947,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
>>> sector += bio_sectors(clone);
>>>
>>> crypt_inc_pending(io);
>>> - r = crypt_convert(cc, &io->ctx);
>>> + r = crypt_convert(cc, &io->ctx, test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags));
>>> if (r)
>>> io->error = r;
>>> crypt_finished = atomic_dec_and_test(&io->ctx.cc_pending);
>>> @@ -1971,7 +1977,7 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
>>> crypt_convert_init(cc, &io->ctx, io->base_bio, io->base_bio,
>>> io->sector);
>>>
>>> - r = crypt_convert(cc, &io->ctx);
>>> + r = crypt_convert(cc, &io->ctx, test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags));
>>> if (r)
>>> io->error = r;
>>>
>>> @@ -2031,9 +2037,29 @@ static void kcryptd_crypt(struct work_struct *work)
>>> kcryptd_crypt_write_convert(io);
>>> }
>>>
>>> +static void kcryptd_crypt_tasklet(unsigned long work)
>>> +{
>>> + kcryptd_crypt((struct work_struct *)work);
>>> +}
>>> +
>>> static void kcryptd_queue_crypt(struct dm_crypt_io *io)
>>> {
>>> struct crypt_config *cc = io->cc;
>>> + bool noworkqueue =
>>> + (bio_data_dir(io->base_bio) == READ && test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags)) ||
>>> + (bio_data_dir(io->base_bio) == WRITE && test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags));
>>
>> Since this variable is used only in the if statement bleow, why not used the
>> condition directly in that statement ?
>>
>>> +
>>> + if (noworkqueue) {
>>> + if (in_irq()) {
>>> + /* Crypto API's "skcipher_walk_first() refuses to work in hard IRQ context */
>>> + tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
>>> + tasklet_schedule(&io->tasklet);
>>> + return;
>>> + }
>>> +
>>> + kcryptd_crypt(&io->work);
>>> + return;
>>> + }
>>>
>>> INIT_WORK(&io->work, kcryptd_crypt);
>>> queue_work(cc->crypt_queue, &io->work);
>>> @@ -2838,7 +2864,7 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
>>> struct crypt_config *cc = ti->private;
>>> struct dm_arg_set as;
>>> static const struct dm_arg _args[] = {
>>> - {0, 6, "Invalid number of feature args"},
>>> + {0, 8, "Invalid number of feature args"},
>>> };
>>> unsigned int opt_params, val;
>>> const char *opt_string, *sval;
>>> @@ -2868,6 +2894,10 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
>>>
>>> else if (!strcasecmp(opt_string, "submit_from_crypt_cpus"))
>>> set_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags);
>>> + else if (!strcasecmp(opt_string, "no_read_workqueue"))
>>> + set_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags);
>>> + else if (!strcasecmp(opt_string, "no_write_workqueue"))
>>> + set_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags);
>>> else if (sscanf(opt_string, "integrity:%u:", &val) == 1) {
>>> if (val == 0 || val > MAX_TAG_SIZE) {
>>> ti->error = "Invalid integrity arguments";
>>> @@ -3196,6 +3226,8 @@ static void crypt_status(struct dm_target *ti, status_type_t type,
>>> num_feature_args += !!ti->num_discard_bios;
>>> num_feature_args += test_bit(DM_CRYPT_SAME_CPU, &cc->flags);
>>> num_feature_args += test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags);
>>> + num_feature_args += test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags);
>>> + num_feature_args += test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags);
>>> num_feature_args += cc->sector_size != (1 << SECTOR_SHIFT);
>>> num_feature_args += test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags);
>>> if (cc->on_disk_tag_size)
>>> @@ -3208,6 +3240,10 @@ static void crypt_status(struct dm_target *ti, status_type_t type,
>>> DMEMIT(" same_cpu_crypt");
>>> if (test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags))
>>> DMEMIT(" submit_from_crypt_cpus");
>>> + if (test_bit(DM_CRYPT_NO_READ_WORKQUEUE, &cc->flags))
>>> + DMEMIT(" no_read_workqueue");
>>> + if (test_bit(DM_CRYPT_NO_WRITE_WORKQUEUE, &cc->flags))
>>> + DMEMIT(" no_write_workqueue");
>>> if (cc->on_disk_tag_size)
>>> DMEMIT(" integrity:%u:%s", cc->on_disk_tag_size, cc->cipher_auth);
>>> if (cc->sector_size != (1 << SECTOR_SHIFT))
>>> @@ -3320,7 +3356,7 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
>>>
>>> static struct target_type crypt_target = {
>>> .name = "crypt",
>>> - .version = {1, 21, 0},
>>> + .version = {1, 22, 0},
>>> .module = THIS_MODULE,
>>> .ctr = crypt_ctr,
>>> .dtr = crypt_dtr,
>>>
>>
>>
>> --
>> Damien Le Moal
>> Western Digital Research
>


--
Damien Le Moal
Western Digital Research