Hi Mikulas:
I will send dm-pcache V1 soon, below is my response to your comments.
Hi On Thu, 5 Jun 2025, Dongsheng Yang wrote:Hi Mikulas and all, This is *RFC v2* of the *pcache* series, a persistent-memory backed cache. Compared with *RFC v1* <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.yang@xxxxxxxxx/> the most important change is that the whole cache has been *ported to the Device-Mapper framework* and is now exposed as a regular DM target. Code: https://github.com/DataTravelGuide/linux/tree/dm-pcache Full RFC v2 test results: https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html All 962 xfstests cases passed successfully under four different pcache configurations. One of the detailed xfstests run: https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log Below is a quick tour through the three layers of the implementation, followed by an example invocation. ---------------------------------------------------------------------- 1. pmem access layer ---------------------------------------------------------------------- * All reads use *copy_mc_to_kernel()* so that uncorrectable media errors are detected and reported. * All writes go through *memcpy_flushcache()* to guarantee durability on real persistent memory.You could also try to use normal write and clflushopt for big writes - I found out that for larger regions it is better - see the function memcpy_flushcache_optimized in dm-writecache. Test, which way is better.
I did a test with fio on /dev/pmem0, with an attached patch on nd_pmem.ko:
when I use memmap pmem device, I got a similar result with the comment in memcpy_flushcache_optimized():
Test (memmap pmem) clflushopt flushcache ------------------------------------------------- test_randwrite_512 200 MiB/s 228 MiB/s test_randwrite_1024 378 MiB/s 431 MiB/s test_randwrite_2K 773 MiB/s 769 MiB/s test_randwrite_4K 1364 MiB/s 1272 MiB/s test_randwrite_8K 2078 MiB/s 1817 MiB/s test_randwrite_16K 2745 MiB/s 2098 MiB/s test_randwrite_32K 3232 MiB/s 2231 MiB/s test_randwrite_64K 3660 MiB/s 2411 MiB/s test_randwrite_128K 3922 MiB/s 2513 MiB/s test_randwrite_1M 3824 MiB/s 2537 MiB/s test_write_512 228 MiB/s 228 MiB/s test_write_1024 439 MiB/s 423 MiB/s test_write_2K 841 MiB/s 800 MiB/s test_write_4K 1364 MiB/s 1308 MiB/s test_write_8K 2107 MiB/s 1838 MiB/s test_write_16K 2752 MiB/s 2166 MiB/s test_write_32K 3213 MiB/s 2247 MiB/s test_write_64K 3661 MiB/s 2415 MiB/s test_write_128K 3902 MiB/s 2514 MiB/s test_write_1M 3808 MiB/s 2529 MiB/s
But I got a different result when I use Optane pmem100:
Test (Optane pmem100) clflushopt flushcache ------------------------------------------------- test_randwrite_512 167 MiB/s 226 MiB/s test_randwrite_1024 301 MiB/s 420 MiB/s test_randwrite_2K 615 MiB/s 639 MiB/s test_randwrite_4K 967 MiB/s 1024 MiB/s test_randwrite_8K 1047 MiB/s 1314 MiB/s test_randwrite_16K 1096 MiB/s 1377 MiB/s test_randwrite_32K 1155 MiB/s 1382 MiB/s test_randwrite_64K 1184 MiB/s 1452 MiB/s test_randwrite_128K 1199 MiB/s 1488 MiB/s test_randwrite_1M 1178 MiB/s 1499 MiB/s test_write_512 233 MiB/s 233 MiB/s test_write_1024 424 MiB/s 391 MiB/s test_write_2K 706 MiB/s 760 MiB/s test_write_4K 978 MiB/s 1076 MiB/s test_write_8K 1059 MiB/s 1296 MiB/s test_write_16K 1119 MiB/s 1380 MiB/s test_write_32K 1158 MiB/s 1387 MiB/s test_write_64K 1184 MiB/s 1448 MiB/s test_write_128K 1198 MiB/s 1481 MiB/s test_write_1M 1178 MiB/s 1486 MiB/s
So for now I’d rather keep using flushcache in pcache. In future, once we’ve come up with a general-purpose optimization, we can switch to that.
---------------------------------------------------------------------- 2. cache-logic layer (segments / keys / workers) ---------------------------------------------------------------------- Main features - 16 MiB pmem segments, log-structured allocation. - Multi-subtree RB-tree index for high parallelism. - Optional per-entry *CRC32* on cached data.Would it be better to use crc32c because it has hardware support in the SSE4.2 instruction set?
Good suggestion, thanx.
- Background *write-back* worker and watermark-driven *GC*. - Crash-safe replay: key-sets are scanned from *key_tail* on start-up. Current limitations - Only *write-back* mode implemented. - Only FIFO cache invalidate; other (LRU, ARC...) planned. ---------------------------------------------------------------------- 3. dm-pcache target integration ---------------------------------------------------------------------- * Table line `pcache <pmem_dev> <origin_dev> writeback <true|false>` * Features advertised to DM: - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured (they force all open key-sets to close and data to be durable). * Not yet supported: - Discard / TRIM. - dynamic `dmsetup reload`.If you don't support it, you should at least try to detect that the user did reload and return error - so that there won't be data corruption in this case.
Yes, actually It did return -EOPNOTSUPP。
static int dm_pcache_ctr(struct dm_target *ti, unsigned int argc,
char **argv)
{
struct mapped_device *md = ti->table->md;
struct dm_pcache *pcache;
int ret;
if (md->map) {
ti->error = "Don't support table loading for
live md";
return -EOPNOTSUPP;
}
But it would be better to support table reload. You can support it by a similar mechanism to "__handover_exceptions" in the dm-snap.c driver.
Of course, that's planned but we think that's not necessary in initial version of it.
Runtime controls - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger. Status line reports super-block flags, segment counts, GC threshold and the three tail/head pointers (see the RST document for details).Perhaps these are not real bugs (I didn't analyze it thoroughly), but there are some GFP_NOWAIT and GFP_KERNEL allocations. GFP_NOWAIT can fail anytime (for example, if the machine receives too many network packets), so you must handle the error gracefully. GFP_KERNEL allocation may recurse back into the I/O path through swapping or file writeback, thus they may cause deadlocks. You should use GFP_KERNEL in the target constructor or destructor because there is no I/O to be processed in this time, but they shouldn't be used in the I/O processing path.
**Yes**, I'm using the approach I described above.
- **GFP_KERNEL** is only used on the constructor path.
- **GFP_NOWAIT** is used in two cases: one for allocating
`cache_key`, and another for allocating `backing_dev_req` on a
cache miss.
Both of these NOWAIT allocations happen while traversing the
`cache_subtree` under the `subtree_lock` spinlock—i.e., in
contexts where scheduling isn't allowed. That’s why we use
`GFP_NOWAIT`.
I see that when you get ENOMEM, you retry the request in 100ms - putting arbitrary waits in the code is generally bad practice - this won't work if the user is swapping to the dm-pcache device. It may be possible that there is no memory free, thus retrying won't help and it will deadlock.
In **V1**, I removed the retry-on-ENOMEM mechanism to make pcache
more coherent. Now it only retries when the cache is actually
full.
- When the cache is full, the `pcache_req` is added to a
`defer_list`.
- When cache invalidation occurs, we kick off all deferred
requests to retry.
This eliminates arbitrary waits.
I suggest to use mempools to guarantee forward progress in out-of-memory situation. A mempool_alloc(GFP_IO) will never return NULL, it will just wait until some other process frees some entry into the mempool.
Both `cache_key` and `backing_dev_req` now use **mempools**, but—as explained—they may still be allocated under a spinlock, so they use **GFP_NOWAIT**.
Thanks, that's a good suggestion.Generally, a convention among device mapper targets is that the have a few fixed parameters first, then there is a number of optional parameters and then there are optional parameters (either in "parameter:123" or "parameter 123" format). You should follow this convention, so that it can be easily extended with new parameters later.
pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
The __packed attribute causes performance degradation on risc machines without hardware support for unaligned accesses - the compiled will generate byte-by-byte accesses - I suggest to not use it and instead make sure that the members in the structures are naturally aligned (and inserting explicit padding if needed).
Thanks — we removed `__packed` in V1.
The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache uses) - I suggest to add more optimizations to it for constant sizes that fit the usage pattern of dm-pcache,
I see that you are using "queue_delayed_work(cache_get_wq(cache), &cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache), &cache->writeback_work, delay);" - the problem here is that if the entry is already queued with a delay and you attempt to queue it again with zero again, this new queue attempt will be ignored - I'm not sure if this is intended behavior or not.
Yes, in simple terms:
1. If after writeback completes the cache is clean, we should
delay.
2. If it's not clean, set delay = 0 so the next writeback cycle
can begin immediately.
In theory these are two separate branches, even if a zero delay
might be ignored. Since writeback work isn't highly
time-sensitive, I believe this is harmless.
req_complete_fn: this will never run with interrupts disabled, so you can replace spin_lock_irqsave/spin_unlock_irqrestore with spin_lock_irq/spin_unlock_irq.
thanx Fixed in V1
backing_dev_bio_end: there's a bug in this function - it may be called both with interrupts disabled and interrupts enabled, so you should not use spin_lock/spin_unlock. You should be called spin_lock_irqsave/spin_unlock_irqrestore.
Good catch,it's fixed in V1
queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock - see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar problem.
Thanx, Fixed in V1
bio_map - bio vectors can hold arbitrarily long entries - if the "base" variable is not from vmalloc, you can just add it one bvec entry.
Good idea — in V1 we introduced `inline_bvecs`, and when `base`
isn’t vmalloc, we use a single bvec entry.
Fixed in V1"backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of kcalloc, there's no need to zero the value.
+ if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT) + return NULL; + + udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL); + goto again;This is not good practice to insert arbitrary waits (here, the wait is burning CPU power, which makes it even worse). You should add the process to a wait queue and wake up the queue. See the functions writecache_wait_on_freelist and writecache_free_entry for an example, how to wait correctly.
`get_cache_segment()` may be called under a spinlock, so I originally designed a “short-busy-wait” to wait for a free segment,
If wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT, then return -EBUSY to let dm-pcache defer_list to retry.
However, I later discovered that when the cache is full, there's rarely enough time during that brief short-busy-wait to free a segment.
Therefore in V1 I removed it and instead return `-EBUSY`, allowing dm-pcache’s retry mechanism to wait for a cache invalidation before retrying.
+static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio) +{ + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct pcache_request)); + struct dm_pcache *pcache = ti->private; + int ret; + + pcache_req->pcache = pcache; + kref_init(&pcache_req->ref); + pcache_req->ret = 0; + pcache_req->bio = bio; + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT; + pcache_req->data_len = bio->bi_iter.bi_size; + INIT_LIST_HEAD(&pcache_req->list_node); + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);This looks suspicious because you store the original bi_sector to pcache_req->off and then subtract the target offset from it. Shouldn't "bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);" be before "pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;"?
Yes, that logic is indeed questionable, but it works in testing.
Since we define dm-pcache as a **singleton**, both behaviors should effectively be equivalent, IIUC. Also, in V1 I moved the call to `dm_target_offset()` so it runs before setting up `pcache_req->off`, making the code logic correct.
Thanx
Dongsheng
From 88476f95801f7177c3a8c30c663cdd788f73a3b5 Mon Sep 17 00:00:00 2001Generally, the code doesn't seem bad. After reworking the out-of-memory handling and replacing arbitrary waits with wait queues, I can merge it. Mikulas