Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching software forLinux kernel

From: thornber
Date: Thu Jan 17 2013 - 08:26:44 EST


On Thu, Jan 17, 2013 at 05:52:00PM +0800, Amit Kale wrote:
> Hi Joe, Kent,
>
> [Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]
>
> My understanding is that these three caching solutions all have three principle blocks.

Let me try and explain how dm-cache works.

> 1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.

Of course we have this, but it's part of the policy plug-in. I've
done this because the policy nearly always needs to do some book
keeping (eg, update a hit count when accessed).

> 2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.

I think there's more than just this. These are the tasks that I hand
over to the policy:

a) _Which_ blocks should be promoted to the cache. This seems to be
the key decision in terms of performance. Blindly trying to
promote every io or even just every write will lead to some very
bad performance in certain situations.

The mq policy uses a multiqueue (effectively a partially sorted
lru list) to keep track of candidate block hit counts. When
candidates get enough hits they're promoted. The promotion
threshold his periodically recalculated by looking at the hit
counts for the blocks already in the cache.

The hit counts should degrade over time (for some definition of
time; eg. io volume). I've experimented with this, but not yet
come up with a satisfactory method.

I read through EnhanceIO yesterday, and think this is where
you're lacking.

b) When should a block be promoted. If you're swamped with io, then
adding copy io is probably not a good idea. Current dm-cache
just has a configurable threshold for the promotion/demotion io
volume. If you or Kent have some ideas for how to approximate
the bandwidth of the devices I'd really like to hear about it.

c) Which blocks should be demoted?

This is the bit that people commonly think of when they say
'caching algorithm'. Examples are lru, arc, etc. Such
descriptions are fine when describing a cache where elements
_have_ to be promoted before they can be accessed, for example a
cpu memory cache. But we should be aware that 'lru' for example
really doesn't tell us much in the context of our policies.

The mq policy uses a blend of lru and lfu for eviction, it seems
to work well.

A couple of other things I should mention; dm-cache uses a large block
size compared to eio. eg, 64k - 1m. This is a mixed blessing;

- our copy io is more efficient (we don't have to worry about
batching migrations together so much. Something eio is careful to
do).

- we have fewer blocks to hold stats about, so can keep more info per
block in the same amount of memory.

- We trigger more copying. For example if an incoming write triggers
a promotion from the origin to the cache, and the io covers a block
we can avoid any copy from the origin to cache. With a bigger
block size this optmisation happens less frequently.

- We waste SSD space. eg, a 4k hotspot could trigger a whole block
to be moved to the cache.


We do not keep the dirty state of cache blocks up to date on the
metadata device. Instead we have a 'mounted' flag that's set in the
metadata when opened. When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted flag
cleared. On a crash the mounted flag will still be set on reopen and
all dirty flags degrade to 'dirty'. Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?

I really view dm-cache as a slow moving hotspot optimiser. Whereas I
think eio and bcache are much more of a heirarchical storage approach,
where writes go through the cache if possible?

> 3. IO handling - This is about issuing IO requests to SSD and HDD.

I get most of this for free via dm and kcopyd. I'm really keen to
see how bcache does; it's more invasive of the block layer, so I'm
expecting it to show far better performance than dm-cache.

> 4. Dirty data clean-up algorithm (for write-back only) - The dirty
data clean-up algorithm decides when to write a dirty block in an
SSD to its original location on HDD and executes the copy.

Yep.

> When comparing the three solutions we need to consider these aspects.

> 1. User interface - This consists of commands used by users for
creating, deleting, editing properties and recovering from error
conditions.

I was impressed how easy eio was to use yesterday when I was playing
with it. Well done.

Driving dm-cache through dm-setup isn't much more of a hassle
though. Though we've decided to pass policy specific params on the
target line, and tweak via a dm message (again simple via dmsetup).
I don't think this is as simple as exposing them through something
like sysfs, but it is more in keeping with the device-mapper way.

> 2. Software interface - Where it interfaces to Linux kernel and applications.

See above.

> 3. Availability - What's the downtime when adding, deleting caches,
making changes to cache configuration, conversion between cache
modes, recovering after a crash, recovering from an error condition.

Normal dm suspend, alter table, resume cycle. The LVM tools do this
all the time.

> 4. Security - Security holes, if any.

Well I saw the comment in your code describing the security flaw you
think you've got. I hope we don't have any, I'd like to understand
your case more.

> 5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.

I think we all work with any block device. But eio and bcache can
overlay any device node, not just a dm one. As mentioned in earlier
email I really think this is a dm issue, not specific to dm-cache.

> 6. Persistence of cache configuration - Once created does the cache
configuration stay persistent across reboots. How are changes in
device sequence or numbering handled.

We've gone for no persistence of policy parameters. Instead
everything is handed into the kernel when the target is setup. This
decision was made by the LVM team who wanted to store this
information themselves (we certainly shouldn't store it in two
places at once). I don't feel strongly either way, and could
persist the policy params v. easily (eg, 1 days work).

One thing I do provide is a 'hint' array for the policy to use and
persist. The policy specifies how much data it would like to store
per cache block, and then writes it on clean shutdown (hence 'hint',
it has to cope without this, possibly with temporarily degraded
performance). The mq policy uses the hints to store hit counts.

> 7. Persistence of cached data - Does cached data remain across
reboots/crashes/intermittent failures. Is the "sticky"ness of data
configurable.

Surely this is a given? A cache would be trivial to write if it
didn't need to be crash proof.

> 8. SSD life - Projected SSD life. Does the caching solution cause
too much of write amplification leading to an early SSD failure.

No, I decided years ago that life was too short to start optimising
for specific block devices. By the time you get it right the
hardware characteristics will have moved on. Doesn't the firmware
on SSDs try and even out io wear these days?

That said I think we evenly use the SSD. Except for the superblock
on the metadata device.

> 9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under
different load classes can be measured.

I think latency is more important than throughput. Spindles are
pretty good at throughput. In fact the mq policy tries to spot when
we're doing large linear ios and stops hit counting; best leave this
stuff on the spindle.

> 10. ACID properties - Atomicity, Concurrency, Idempotent,
Durability. Does the caching solution have these typical
transactional database or filesystem properties. This includes
avoiding torn-page problem amongst crash and failure scenarios.

Could you expand on the torn-page issue please?

> 11. Error conditions - Handling power failures, intermittent and permanent device failures.

I think the area where dm-cache is currently lacking is intermittent
failures. For example if a cache read fails we just pass that error
up, whereas eio sees if the block is clean and if so tries to read
off the origin. I'm not sure which behaviour is correct; I like to
know about disk failure early.

> 12. Configuration parameters for tuning according to applications.

Discussed above.

> We'll soon document EnhanceIO behavior in context of these
aspects. We'll appreciate if dm-cache and bcache is also documented.

I hope the above helps. Please ask away if you're unsure about
something.

> When comparing performance there are three levels at which it can be measured

Developing these caches is tedious. Test runs take time, and really
slow the dev cycle down. So I suspect we've all been using
microbenchmarks that run in a few minutes.

Let's get our pool of microbenchmarks together, then work on some
application level ones (we're happy to put some time into developing
these).

- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/