Re: [RFC][PATCH] bcache: cache a block device with an ssd

From: Jeff Moyer
Date: Tue Apr 13 2010 - 11:47:42 EST

Next message: Mel Gorman: "Re: [PATCH 2/6] change alloc function in pcpu_alloc_pages"
Previous message: Prarit Bhargava: "[PATCH]: x86: remove extra bootmem.h from arch/x86/mm/init_64.c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Kent Overstreet <kent.overstreet@xxxxxxxxx> writes:

> Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
> nice if you could use them as cache...
>
> Thus bcache (imaginative name, eh?). It's designed around the
> performance characteristics of SSDs - it only allocates in erase block
> sized buckets, and it uses a bare minimum btree to track cached extants
> (which can be anywhere from a single sector to the bucket size). It's
> also designed to be very lazy, and use garbage collection to clean stale
> pointers.
>
> The code is very rough and there's a number of things missing before it
> can actually be useful (like garbage collection), but I've got it
> working with an ext4 filesystem on top and it does successfully cache.
> It doesn't yet even look at writes though, so read/write access would
> quickly explode.
>
> A goal of mine was that it be possible to add and remove a cache to an
> existing block device at runtime; this should make it easier and more
> practical to use than were it a stacking block device driver.

I see this has languished without response. This is an interesting
idea, however the first question that comes to mind is have you looked
at fs-cache? Does it come anywhere close to suiting your needs?

Cheers,
Jeff

(the rest of the message is left intact for David's perusal)

> To that end, I'm hooking in to __generic_make_request, right before it
> passes bios to the elevator. This has a number of implications (not
> being able to wait on IO makes things tricky), but it seemed like the
> least bad way to do it, at least for now. As far as I can tell the only
> behavior that changes is that trace_block_remap gets run again if the
> make_request_fn returns nonzero (in that case I just resubmit it via
> generic_make_request; if part of the btree wasn't in the page cache I
> can't simply return it to __generic_make_request). This obviously
> requires more code in __generic_make_request - which I was hoping to
> avoid or at least make generic, but hopefully one more conditional won't
> get me tarred and feathered... or someone will have a better idea.
>
> (There are definitely races here that I'm not pretending to deal with
> yet. I don't think it'll be too bad, though. (There's a lot of locking
> that needs to happen and isn't yet...))
>
> As previously implied, cache hits are tracked on a per bucket basis.
> Each bucket has a 16 bit priority, and I maintain a heap of all the
> buckets by priority. Each bucket also has an 8 bit generation; each
> pointer contains the generation of the bucket it points into. If they
> don't match, it's a stale pointer. There's a small fifo of free buckets
> kept in memory; to refill the fifo, it grabs the bucket with the
> smallest priority, increments the generation, issues a BLK_DISCARD
> request and sticks it on the end of the fifo. We just have to make sure
> we garbage collect the entire btree every so often - that code isn't
> written yet, but it just requires a second array (last_gc_generation of
> every bucket), a heap of generation - last_gc_generation, and then you
> know when you have to gc.
>
> Btree buckets are only completely sorted after they've been garbage
> collected; other times, there's multiple sorted sets. When we go to
> insert a key, we look for a page that is dirty and not full, and insert
> it there sorted into the appropriate location. We write only when a page
> fills up, so the SSD doesn't have to do the erase/rewrite thing. Headers
> contain a 64 bit random number; when we're looking for an open page in a
> btree bucket, it has to match the first page's number or else it's an
> empty page.
>
> Devices we cache are referred to by an 8 bit integer, within the btree.
> When a device is registered, a UUID is passed in too which is stored in
> the superblock. That's done via a sysfs file, currently in
> /sys/kernel/bcache (didn't really seem the place, but I couldn't decide
> on anything better). I.e., you register a backing device with 'echo
> "<uuid> /dev/foo > /sys/kernel/bcache/register_dev', and a cache device
> with 'echo /dev/bar > /sys/kernel/bcache/register_cache'. It keeps some
> statistics there too, and I plan to flesh that out soon.
>
> On the list are journalling and write behind caching - with multiple
> cache devices, it'll be straightforward to mirror writes and drop one
> set when it's written to the backing device.
>
> Also on the list - right now I insert items into the cache whenever
> they're not found, by saving and replacing bio->bi_end_io, inserting
> once the read returns and not returning it as finished until my writes
> finish. This has obvious drawbacks, but I don't know under what if any
> circumstances I could use those pages after the bio completes. It seems
> to me the VM would be in a much better position to know what ought to be
> cached, but for just getting something working this seemed easiest.
>
> That's about all I can think of for now, look forward to any
> commentary/questions/advice. This is the first kernel programming I've
> gotten around to doing, I'm fairly happy with how far it's come in a
> month but there are undeniably parts that are painful to look at...
>
> At the bottom is nearly the shortest possible program to initialize a
> cache device.
>
> diff --git a/block/Kconfig b/block/Kconfig
> index 62a5921..19529ad 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -99,6 +99,10 @@ config DEBUG_BLK_CGROUP
> in the blk group which can be used by cfq for tracing various
> group related activity.
>
> +config BLK_CACHE
> + tristate "Block device as cache"
> + default m
> +
> endif # BLOCK
>
> config BLOCK_COMPAT
> diff --git a/block/Makefile b/block/Makefile
> index cb2d515..e9b5fc0 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -15,3 +15,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
>
> obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
> obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
> +
> +obj-$(CONFIG_BLK_CACHE) += bcache.o
> diff --git a/block/bcache.c b/block/bcache.c
> new file mode 100644
> index 0000000..47c2bc4
> --- /dev/null
> +++ b/block/bcache.c
> @@ -0,0 +1,1387 @@
> +#include <linux/blkdev.h>
> +#include <linux/buffer_head.h>
> +#include <linux/init.h>
> +#include <linux/kobject.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/page-flags.h>
> +#include <linux/random.h>
> +#include <linux/sort.h>
> +#include <linux/string.h>
> +#include <linux/sysfs.h>
> +#include <linux/types.h>
> +#include <linux/workqueue.h>
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Kent Overstreet <kent.overstreet@xxxxxxxxx>");
> +
> +/*
> + * Page 0: unused
> + * Page 1: superblock
> + * Page 2: device UUIDs
> + * Page 3+: bucket priorities
> + *
> + */
> +
> +struct cache_sb {
> + uint8_t magic[16];
> + uint32_t version;
> + uint16_t block_size; /* sectors */
> + uint16_t bucket_size; /* sectors */
> + uint32_t journal_start; /* buckets */
> + uint32_t first_bucket; /* start of data */
> + uint64_t nbuckets; /* device size */
> + uint64_t first_free_bucket; /* buckets that have never been used, only increments */
> + uint64_t btree_root;
> + uint16_t btree_level;
> +};
> +
> +struct bucket {
> + uint32_t heap;
> + uint16_t priority;
> + uint8_t generation;
> +};
> +
> +struct bucket_disk {
> + uint16_t priority;
> + uint8_t generation;
> +};
> +
> +struct btree_node_header {
> + uint32_t csum;
> + uint32_t nkeys;
> + uint64_t random;
> +};
> +
> +struct btree_key {
> + uint64_t key;
> + uint64_t ptr;
> +};
> +
> +struct cache_device {
> + struct cache_sb sb;
> + struct kobject kobj;
> + struct list_head list;
> + struct block_device *bdev;
> + struct module *owner;
> +
> + long heap_size;
> + long *heap;
> + struct bucket *buckets;
> + struct buffer_head *priorities;
> +
> + long *freelist;
> + size_t free_size;
> + size_t free_front;
> + size_t free_back;
> +
> + struct block_device *devices[256];
> + struct buffer_head *uuids;
> +
> + long current_bucket;
> + int sectors_free;
> +};
> +
> +struct journal_header {
> + uint32_t csum;
> + uint32_t seq;
> + uint32_t last_open_entry;
> + uint32_t nr_entries;
> +};
> +
> +struct search_context {
> + struct work_struct w;
> + struct btree_key k;
> + struct bio *bio;
> + void *q;
> + int error;
> + atomic_t remaining;
> + struct search_context *parent;
> + void (*end_fn)(void *, struct bio *, struct search_context *);
> + bio_end_io_t *end_io;
> + sector_t bi_sector;
> +};
> +
> +static const char bcache_magic[] = { 0xc6, 0x85, 0x73, 0xf6, 0x4e, 0x1a, 0x45, 0xca, 0x82, 0x65, 0xf5, 0x7f, 0x48, 0xba, 0x6d, 0x81 };
> +
> +static struct kobject *bcache_kobj;
> +static struct block_device *devices[256];
> +static char uuids[PAGE_SIZE];
> +
> +static LIST_HEAD(cache_devices);
> +
> +static int request_hook(struct request_queue *q, struct bio *bio);
> +static void btree_write_node_bh(struct bio *bio, int error);
> +static void bio_insert(void *private, struct bio *bio, struct search_context *s);
> +static void request_hook_read(void *p, struct bio *bio, struct search_context *t);
> +static void submit_wait_bio(int rw, struct bio *bio, struct cache_device *c, struct search_context *s);
> +
> +#define pages_per_bucket (c->sb.bucket_size >> (PAGE_SHIFT - 9))
> +#define bucket_to_sector(b) ((uint64_t) ((b) + c->sb.first_bucket) * c->sb.bucket_size)
> +#define sector_to_bucket(s) ((long) ((s) / c->sb.bucket_size) - c->sb.first_bucket)
> +#define sector_to_gen(s) c->buckets[sector_to_bucket(s)].generation
> +
> +/*
> + * 32 keys in a page
> + * key: 8 bit device, 56 bit offset
> + * value: 8 bit generation, 16 bit length, 40 bit offset
> + * All units are in sectors
> + * XXX: Need to take into account PAGE_SIZE...
> + */
> +static inline uint64_t *tree_key(void *p[], int i)
> +{ return p[i >> 5] + ((i & ~(~0 << 5)) << 7); }
> +
> +static inline uint64_t *tree_ptr(void *p[], int i)
> +{ return p[i >> 5] + ((i & ~(~0 << 5)) << 7) + sizeof(uint64_t); }
> +
> +#define TREE_KEY(device, offset) (((uint64_t) device) << 56 | (offset))
> +#define TREE_KEY_OFFSET(page, i) (*tree_key(page, i) & ~((uint64_t) ~0 << 56))
> +
> +#define TREE_PTR(gen, length, offset) ((gen) | (length) << 8 | (offset) << 24)
> +#define TREE_PTR_GEN(page, i) (*tree_ptr(page, i) & ~(~0 << 8))
> +#define TREE_PTR_LENGTH(page, i) ((*tree_ptr(page, i) >> 8) & ~(~0 << 16))
> +#define TREE_PTR_OFFSET(page, i) ((*tree_ptr(page, i) >> 24))
> +/*#define TREE_PTR_BUCKET(page, i) ((*tree_ptr(page, i) >> 24)\
> + / c->sb.bucket_size - c->sb.first_bucket) */
> +
> +static int lookup_dev(struct cache_device *c, struct bio *bio)
> +{
> + int dev;
> + for (dev = 0; dev < 256 && c->devices[dev] != bio->bi_bdev; dev++)
> + ;
> +
> + if (dev == 256)
> + printk(KERN_DEBUG "bcache: unknown device\n");
> +
> + return dev;
> +}
> +
> +// called by heap location
> +static int heap_swap(struct cache_device *c, long i, long j)
> +{
> + if (c->buckets[c->heap[i]].priority >= c->buckets[c->heap[j]].priority)
> + return 0;
> +
> + c->heap[i] = c->buckets[j].heap;
> + c->heap[j] = c->buckets[i].heap;
> +
> + c->buckets[c->heap[i]].heap = i;
> + c->buckets[c->heap[j]].heap = j;
> + return 1;
> +}
> +
> +static void heap_sort(struct cache_device *c, long h)
> +{
> + while (h > 0) {
> + uint32_t p = ((h + 1) >> 1) - 1;
> + if (!heap_swap(c, h, p))
> + break;
> +
> + h = p;
> + }
> +
> + while (1) {
> + uint32_t r = ((h + 1) << 1) + 1;
> + uint32_t l = ((h + 1) << 1);
> +
> + if (r < c->heap_size &&
> + c->buckets[c->heap[r]].priority < c->buckets[c->heap[l]].priority &&
> + heap_swap(c, h, r))
> + h = r;
> + else if (l < c->heap_size && heap_swap(c, h, l))
> + h = l;
> + else
> + break;
> + }
> +}
> +
> +static void heap_insert(struct cache_device *c, long b)
> +{
> + c->buckets[b].heap = c->heap_size;
> + c->heap[c->heap_size++] = b;
> + heap_sort(c, c->buckets[b].heap);
> +}
> +
> +/*static void heap_remove(struct cache_device *c, long b)
> +{
> + long top = c->heap[--c->heap_size];
> +
> + c->buckets[top].heap = c->buckets[b].heap;
> + c->heap[c->buckets[b].heap] = b;
> + heap_sort(c, c->buckets[b].heap);
> +}*/
> +
> +static long heap_pop(struct cache_device *c)
> +{
> + long ret, top;
> + ret = c->heap[0];
> + top = c->heap[--c->heap_size];
> +
> + c->buckets[top].heap = 0;
> + c->heap[0] = top;
> + heap_sort(c, c->buckets[0].heap);
> +
> + c->buckets[ret].priority = 0;
> + return ret;
> +}
> +
> +static long __pop_bucket(struct cache_device *c)
> +{
> + long r;
> + int free = c->free_front - c->free_back;
> +
> + if (free < 0)
> + free += c->free_size;
> +
> + for (; free < c->free_size >> 1; free++) {
> + if (c->sb.first_free_bucket < c->sb.nbuckets)
> + r = c->sb.first_free_bucket++;
> + else
> + r = heap_pop(c);
> +
> + c->buckets[r].generation++;
> + c->freelist[c->free_front++] = r;
> + c->free_front &= c->free_size;
> +
> + blkdev_issue_discard(c->bdev, bucket_to_sector(r),
> + c->sb.bucket_size, GFP_NOIO, 0);
> + }
> +
> + r = c->freelist[c->free_back++];
> + c->free_back &= c->free_size;
> +
> + return r;
> +}
> +
> +static void pop_bucket(struct cache_device *c)
> +{
> + c->sectors_free = c->sb.bucket_size;
> + c->current_bucket = bucket_to_sector(__pop_bucket(c));
> +}
> +
> +static uint64_t alloc_bucket(struct cache_device *c, struct page *p[], void *data[])
> +{
> + int i;
> + long b = __pop_bucket(c);
> +
> + for (i = 0; i < pages_per_bucket; i++) {
> + p[i] = find_or_create_page(c->bdev->bd_inode->i_mapping,
> + bucket_to_sector(b) + (i << 3), GFP_NOIO);
> + data[i] = kmap(p[i]);
> + }
> + // page is locked...
> +
> + return TREE_PTR(c->buckets[b].generation, 0, bucket_to_sector(b));
> +}
> +
> +static void free_bucket(struct cache_device *c, uint64_t offset, struct page *p[])
> +{
> + long b = sector_to_bucket(offset);
> + struct address_space *mapping = p[0]->mapping;
> +
> + BUG_ON(!c);
> +
> + c->buckets[b].generation++;
> +
> + spin_lock_irq(&mapping->tree_lock);
> + /*for (i = 0; i < pages; i++)
> + __remove_from_page_cache(p[i]);*/
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + blkdev_issue_discard(c->bdev, bucket_to_sector(b),
> + c->sb.bucket_size, GFP_NOIO, 0);
> + c->freelist[c->free_front++] = b;
> + c->free_front &= c->free_size;
> +}
> +
> +static int get_bucket(struct cache_device *c, uint64_t offset, struct page *p[], void *data[], struct search_context **s)
> +{
> + int i, nvecs, ret = 0;
> + struct bio *bio = NULL;
> +
> + memset(&p[0], 0, pages_per_bucket * sizeof(void*));
> +
> + if (sector_to_bucket(offset) >= c->sb.nbuckets) {
> + printk(KERN_DEBUG "bcache: bad bucket number\n");
> + return 0;
> + }
> +
> + offset >>= PAGE_SHIFT - 9;
> +
> + nvecs = find_get_pages(c->bdev->bd_inode->i_mapping, offset, pages_per_bucket, p);
> +
> + if (nvecs != pages_per_bucket && *s == NULL) {
> + printk(KERN_DEBUG "bcache: Making a search context\n");
> + *s = kzalloc(sizeof(struct search_context), GFP_NOIO);
> + atomic_set(&(*s)->remaining, 0);
> + }
> +
> + for (i = 0; i < pages_per_bucket; i++)
> + if (!p[i]) {
> + p[i] = __page_cache_alloc(GFP_NOIO);
> + p[i]->mapping = c->bdev->bd_inode->i_mapping;
> + if (add_to_page_cache_lru(p[i],
> + c->bdev->bd_inode->i_mapping,
> + offset + i,
> + GFP_NOIO & GFP_RECLAIM_MASK)) {
> + __free_pages(p[i], 0);
> + goto wait;
> + }
> +
> + if (!bio) {
> + bio = bio_kmalloc(GFP_NOIO, pages_per_bucket - nvecs);
> + bio->bi_sector = (offset + i) << (PAGE_SHIFT - 9);
> + }
> + ++nvecs;
> +
> + bio->bi_io_vec[bio->bi_vcnt].bv_len = PAGE_SIZE;
> + bio->bi_io_vec[bio->bi_vcnt].bv_offset = 0;
> + bio->bi_io_vec[bio->bi_vcnt].bv_page = p[i];
> + bio->bi_vcnt++;
> + bio->bi_size += PAGE_SIZE;
> + } else {
> +wait: wait_on_page_locked(p[i]);
> +
> + if (bio)
> + submit_wait_bio(READ, bio, c, *s);
> + bio = NULL;
> + if (i == ret)
> + ret++;
> +
> + data[i] = kmap(p[i]);
> + }
> +
> + if (bio)
> + submit_wait_bio(READ, bio, c, *s);
> +
> + //printk(KERN_DEBUG "bcache: get_bucket() return %i\n", ret);
> + return ret;
> +}
> +
> +static void put_bucket(struct cache_device *c, long offset, struct page *p[])
> +{
> + int i;
> + for (i = 0; i < pages_per_bucket; i++)
> + if (p[i]) {
> + kunmap(p[i]);
> + put_page(p[i]);
> + }
> +}
> +
> +static void bio_run_work(struct work_struct *w)
> +{
> + struct search_context *s = container_of(w, struct search_context, w);
> + s->end_fn(s->q, s->bio, s);
> + if (atomic_read(&s->remaining) == 0) {
> + if (s->parent)
> + if (atomic_dec_and_test(&s->parent->remaining))
> + bio_run_work(&s->parent->w);
> +
> + kfree(s);
> + }
> +}
> +
> +static void bio_add_work(struct bio *bio, int error)
> +{
> + int i;
> + struct search_context *s = bio->bi_private;
> +
> + if (s->end_fn == request_hook_read)
> + for (i = 0; i < bio->bi_vcnt; i++)
> + unlock_page(bio->bi_io_vec[i].bv_page);
> +
> + bio_put(bio);
> +
> + if (atomic_dec_and_test(&s->remaining)) {
> + if (!s->end_fn && s->bio) {
> + s->bio->bi_end_io(s->bio, 0);
> + bio_put(s->bio);
> + kfree(s);
> + } else if (s->q) {
> + s->error = error;
> + INIT_WORK(&s->w, bio_run_work);
> + schedule_work(&s->w);
> + } else {
> + if (s->parent)
> + if (atomic_dec_and_test(&s->parent->remaining)) {
> + INIT_WORK(&s->parent->w, bio_run_work);
> + schedule_work(&s->parent->w);
> + }
> + kfree(s);
> + }
> + }
> +}
> +
> +static void submit_wait_bio(int rw, struct bio *bio, struct cache_device *c, struct search_context *s)
> +{
> + BUG_ON(!bio->bi_vcnt);
> + bio->bi_bdev = c->bdev;
> + bio->bi_private = s;
> + if (!bio->bi_end_io)
> + bio->bi_end_io = bio_add_work;
> +
> + atomic_inc(&s->remaining);
> + submit_bio(rw, bio);
> +}
> +
> +static int btree_bsearch(void *data[], int nkeys, uint64_t search)
> +{
> + int l = 1, r = nkeys + 1;
> +
> + while (l < r) {
> + int m = (l + r) >> 1;
> + if (*tree_key(data, m) < search)
> + l = m + 1;
> + else
> + r = m;
> + }
> +
> + return l;
> +}
> +
> +static int node_compare(const void *l, const void *r)
> +{
> + const struct btree_key *a = l, *b = r;
> + return a->key - b->key;
> +}
> +
> +static bool ptr_checks(struct cache_device *c, uint64_t p)
> +{
> + if (sector_to_bucket(p >> 24) < 0 ||
> + sector_to_bucket(p >> 24) > c->sb.nbuckets ||
> + ((p >> 8) & ~(~0 << 16)) > c->sb.bucket_size)
> + return true;
> + return false;
> +}
> +
> +static int btree_clean(struct cache_device *c, struct page *p[])
> +{
> + int l;
> + void *v;
> + struct btree_node_header *h, *i, *j;
> + struct btree_key *k;
> +
> + k = v = vmap(p, pages_per_bucket, VM_MAP, PAGE_KERNEL);
> + if (!v) {
> + printk(KERN_DEBUG "bcache: vmap() error\n");
> + return 1;
> + }
> +
> + while (1) {
> + for (h = i = j = v;
> + (void*) j < v + PAGE_SIZE * pages_per_bucket;
> + j = (void*) h + PAGE_SIZE * ((h->nkeys >> 5) + 1)) {
> + if (i->random != j->random)
> + break;
> + h = i;
> + i = j;
> + }
> + if (h == i)
> + break;
> +
> + memmove(h + h->nkeys, i + 1, i->nkeys * sizeof(struct btree_node_header));
> + h->nkeys += i->nkeys;
> + }
> +
> + for (l = 1; l <= h->nkeys; l++) {
> + if (ptr_checks(c, k[l].ptr)) {
> + printk(KERN_DEBUG "bcache: btree_clean removed bad ptr\n");
> + k[l].key = ~0;
> + continue;
> + }
> +
> + if ((k[l].ptr & ~(~0 << 8)) != sector_to_gen(k[l].ptr >> 24))
> + k[l].key = ~0;
> + }
> +
> + sort(&k[1], h->nkeys, sizeof(struct btree_key), node_compare, NULL);
> +
> + for (; k[h->nkeys].key == ~0; h->nkeys--)
> + ;
> +
> + vunmap(v);
> + return 0;
> +}
> +
> +// Iterate over the sorted sets of pages
> +#define for_each_sorted_set(i, data, h, random) \
> + for (h = data[0], i = data; \
> + i < data + pages_per_bucket && \
> + h->random == ((struct btree_node_header*) data[0])->random;\
> + i += (h->nkeys >> 5) + 1, h = *i)
> +
> +#define sorted_set_checks() \
> + do { \
> + if (h->nkeys + 1 > (pages_per_bucket - (i - data)) * 32) { \
> + printk(KERN_DEBUG \
> + "bcache: Bad btree header: page %li h->nkeys %i\n",\
> + i - data, h->nkeys); \
> + if (i == data) \
> + h->nkeys = 0; \
> + else \
> + h->random = 0; \
> + break; \
> + } \
> + if (h->nkeys + 1 > (pagesread - (i - data)) * 32) { \
> + ret = -1; \
> + goto out; \
> + } \
> + } while (0)
> +
> +static int btree_search(struct cache_device *c, long root, int level, int device, struct bio *bio, struct search_context **s)
> +{
> + int r, ret = 0, j, pagesread;
> + uint64_t search;
> + struct page *p[pages_per_bucket];
> + void *data[pages_per_bucket], **i;
> + struct btree_node_header *h;
> +
> + if ((pagesread = get_bucket(c, root, p, data, s)) <= 0)
> + return -1 - pagesread;
> +
> + search = TREE_KEY(device, bio->bi_sector);
> +
> + for_each_sorted_set(i, data, h, random) {
> + sorted_set_checks();
> +
> + for (j = btree_bsearch(i, h->nkeys, search);
> + search < *tree_key(i, j) + (c->sb.bucket_size << 8) &&
> + j <= h->nkeys;
> + j++)
> + if (level) {
> + r = btree_search(c, TREE_PTR_OFFSET(i, j), level - 1, device, bio, s);
> + ret = r == 1 ? 1 : min(r, ret);
> + } else {
> + printk(KERN_DEBUG "bcache: btree_search() j %i key %llu ptr %llu", j,
> + *tree_key(i, j), *tree_ptr(i, j));
> +
> + if (ptr_checks(c, *tree_ptr(i, j))) {
> + printk(KERN_DEBUG "bad ptr\n");
> + continue;
> + }
> + if (TREE_PTR_GEN(i, j) != sector_to_gen(TREE_PTR_OFFSET(i, j))) {
> + printk(KERN_DEBUG "bad gen\n");
> + continue;
> + }
> + if (search > *tree_key(i, j) + TREE_PTR_LENGTH(i, j)) {
> + printk(KERN_DEBUG "early block \n");
> + continue;
> + }
> + if (search + bio_sectors(bio) < *tree_key(i, j)) {
> + printk(KERN_DEBUG "late block\n");
> + continue;
> + }
> +
> + if (bio->bi_sector >= TREE_KEY_OFFSET(i, j) &&
> + bio->bi_sector + bio_sectors(bio) <=
> + TREE_KEY_OFFSET(i, j) + TREE_PTR_LENGTH(i, j)) {
> + // all the data we need is here
> + bio->bi_sector = TREE_PTR_OFFSET(i, j) + (bio->bi_sector - TREE_KEY_OFFSET(i, j));
> + bio->bi_bdev = c->bdev;
> + ret = 1;
> + goto out;
> + } else {
> + // got some, need more...
> + }
> + }
> + }
> +out:
> + put_bucket(c, root, p);
> + return ret;
> +}
> +
> +static void btree_write_node(struct cache_device *c, struct page *p[], int nkeys, int pages)
> +{
> + int i, n = (nkeys >> 5) + 1;
> + struct bio *bio;
> +
> + bio = bio_kmalloc(GFP_NOIO, n);
> +
> + bio->bi_sector = page_index(p[0]) >> 3;
> + bio->bi_bdev = c->bdev;
> + bio->bi_size = n * PAGE_SIZE;
> + bio->bi_end_io = btree_write_node_bh;
> +
> + bio->bi_vcnt = n;
> + for (i = 0; i < n; i++) {
> + bio->bi_io_vec[i].bv_page = p[i];
> + bio->bi_io_vec[i].bv_len = PAGE_SIZE;
> + bio->bi_io_vec[i].bv_offset = 0;
> +
> + ClearPageDirty(p[i]);
> + get_page(p[i]);
> + unlock_page(p[i]);
> + }
> +
> + for (; i < pages; i++)
> + unlock_page(p[i]);
> +
> + submit_bio(WRITE, bio);
> +}
> +
> +static void btree_write_node_bh(struct bio *bio, int error)
> +{
> + int i;
> + for (i = 0; i > bio->bi_vcnt; i++)
> + put_page(bio->bi_io_vec[i].bv_page);
> +
> + bio_put(bio);
> +}
> +
> +static void btree_insert_one_key(void *i[], struct btree_key *k)
> +{
> + int j, m;
> + struct btree_node_header *h = i[0];
> +
> + m = btree_bsearch(i, h->nkeys, k->key);
> +
> + printk(KERN_DEBUG "btree_insert() at %i h->nkeys %i key %llu ptr %llu\n", m, h->nkeys, k->key, k->ptr);
> +
> + for (j = h->nkeys++; j >= m; --j)
> + memcpy(tree_key(i, j + 1), tree_key(i, j), sizeof(struct btree_key));
> +
> + memcpy(tree_key(i, m), k, sizeof(struct btree_key));
> +}
> +
> +static int btree_split(struct cache_device *c, long root, int level, struct btree_key *k, struct btree_key *new_keys,
> + struct page *p[], void *data[], int nkeys)
> +{
> + int j, ret;
> + struct page *p1[pages_per_bucket], *p2[pages_per_bucket];
> + void *d1[pages_per_bucket], *d2[pages_per_bucket];
> + struct btree_node_header *h, *h1, *h2;
> + struct btree_key t[2];
> +
> + for (j = 0; j < pages_per_bucket; j++)
> + if (!trylock_page(p[j])) {
> + wait_on_page_locked(p[j]);
> +
> + for (--j; j >= 0; --j)
> + unlock_page(p[j]);
> +
> + return -1;
> + }
> +
> + btree_clean(c, p);
> + h = data[0];
> +
> + t[1].key = *tree_key(data, h->nkeys >> 1);
> + t[0].key = *tree_key(data, h->nkeys);
> + t[1].ptr = alloc_bucket(c, p1, d1);
> + t[0].ptr = alloc_bucket(c, p2, d2);
> + h1 = *d1;
> + h2 = *d2;
> +
> + get_random_bytes(&h1->random, sizeof(uint64_t));
> + get_random_bytes(&h2->random, sizeof(uint64_t));
> + h1->nkeys = h->nkeys >> 1;
> + h2->nkeys = h->nkeys - h1->nkeys;
> +
> + for (j = 1; j <= h1->nkeys; j++)
> + memcpy(tree_key(d1, j), tree_key(data, j), sizeof(struct btree_key));
> + for (j = 1; j <= h2->nkeys; j++)
> + memcpy(tree_key(d2, j), tree_key(data, j + h1->nkeys), sizeof(struct btree_key));
> +
> + for (; nkeys > 0; --nkeys, ++k)
> + if (k->key < *tree_key(d1, h1->nkeys))
> + btree_insert_one_key(d1, k);
> + else
> + btree_insert_one_key(d2, k);
> +
> + btree_write_node(c, p1, h1->nkeys, pages_per_bucket);
> + btree_write_node(c, p2, h2->nkeys, pages_per_bucket);
> + put_bucket(c, t[1].ptr, p1);
> + put_bucket(c, t[0].ptr, p2);
> + free_bucket(c, root, p);
> +
> + // move this into free_bucket?
> + for (j = 0; j < pages_per_bucket; j++)
> + unlock_page(p[j]);
> +
> + if (c->sb.btree_level == level) {
> + // tree depth increases
> + c->sb.btree_root = alloc_bucket(c, p, data);
> + c->sb.btree_level++;
> + h = data[0];
> + get_random_bytes(&h->random, sizeof(uint64_t));
> + h->nkeys = 2;
> + memcpy(tree_key(data, 1), &t[0], sizeof(struct btree_key)); //eh? wrong
> + memcpy(tree_key(data, 2), &t[1], sizeof(struct btree_key));
> + btree_write_node(c, p, h->nkeys, pages_per_bucket);
> + ret = 0;
> + } else
> + ret = 2;
> +
> + memcpy(&new_keys[0], &t[0], sizeof(struct btree_key) * 2);
> + return ret;
> +}
> +
> +static int btree_insert(struct cache_device *c, long root, int level, struct btree_key *k, struct btree_key *new_keys, struct search_context **s)
> +{
> + int j, nkeys = 1, ret = 0, pagesread;
> + uint64_t biggest_key = 0;
> + struct page *p[pages_per_bucket];
> + void *data[pages_per_bucket], **i;
> + struct btree_node_header *h;
> + struct btree_key recurse_key = { .key = ~0, .ptr = 0};
> +
> + if ((pagesread = get_bucket(c, root, p, data, s)) <= 0)
> + return -1 - pagesread;
> +
> + if (level) {
> + for_each_sorted_set(i, data, h, random) {
> + sorted_set_checks();
> +
> + j = btree_bsearch(i, h->nkeys, k->key);
> +
> + while (TREE_PTR_GEN(i, j) != sector_to_gen(TREE_PTR_OFFSET(i, j)))
> + if (++j > h->nkeys)
> + continue;
> +
> + if (*tree_key(i, j) < recurse_key.key)
> + memcpy(&recurse_key, tree_key(i, j), sizeof(struct btree_key));
> + }
> +
> + BUG_ON(recurse_key.key == ~0);
> +
> + if ((nkeys = btree_insert(c, recurse_key.ptr >> 24, level - 1, k, new_keys, s)) == -1)
> + goto out;
> + k = new_keys;
> + }
> +
> +retry:
> + biggest_key = 0;
> + for (; nkeys > 0; --nkeys, ++k) {
> + for_each_sorted_set(i, data, h, random) {
> + sorted_set_checks();
> +
> + biggest_key = max(biggest_key, *tree_key(i, h->nkeys));
> +
> + if (PageDirty(p[i - data]) && h->nkeys < 32)
> + goto insert;
> + }
> + if (pagesread != pages_per_bucket) {
> + ret = -1;
> + goto out;
> + }
> + if (i == data + pages_per_bucket) {
> + printk(KERN_DEBUG "bcache: btree_insert() splitting\n");
> + if ((ret = btree_split(c, root, level, k, new_keys, p, data, nkeys)) == -1) {
> + ret = 0;
> + goto retry;
> + }
> + goto out;
> + }
> +insert:
> + if (!trylock_page(p[i - data])) {
> + wait_on_page_locked(p[i - data]);
> + goto retry;
> + }
> + SetPageDirty(p[i - data]);
> +
> + if (h->random != ((struct btree_node_header*) data[0])->random) {
> + h->random = ((struct btree_node_header*) data[0])->random;
> + h->nkeys = 0;
> + }
> +
> + for (; nkeys && h->nkeys < 31; --nkeys, ++k) {
> + btree_insert_one_key(i, k);
> +
> + if (k->key > biggest_key && c->sb.btree_level != level) {
> + new_keys[0].key = k->key;
> + new_keys[0].ptr = TREE_PTR(++sector_to_gen(root), 0, root);
> + ret = 1;
> + }
> +
> + biggest_key = max(k->key, biggest_key);
> + }
> +
> + if (h->nkeys == 31)
> + btree_write_node(c, &p[i - data], h->nkeys, 0);
> + else
> + unlock_page(p[i - data]);
> + }
> +out:
> + put_bucket(c, root, p);
> + return ret;
> +}
> +
> +static void bio_insert_finish(void *q, struct bio *bio, struct search_context *s)
> +{
> + struct cache_device *c = q;
> + struct btree_key new_keys[2];
> +
> + btree_insert(c, c->sb.btree_root, c->sb.btree_level, &s->k, new_keys, &s);
> +}
> +
> +static void bio_insert(void *private, struct bio *bio, struct search_context *s)
> +{
> + int dev, written = 0;
> + struct cache_device *c;
> + struct btree_key k, new_keys[2];
> + struct bio *n;
> + struct search_context *t = NULL;
> +
> + s->end_fn = NULL;
> + bio->bi_end_io = s->end_io;
> + bio->bi_private = private;
> + bio->bi_sector = s->bi_sector;
> +
> + if (s->error || list_empty(&cache_devices))
> + goto err;
> +
> + list_rotate_left(&cache_devices);
> + c = list_first_entry(&cache_devices, struct cache_device, list);
> +
> + if ((dev = lookup_dev(c, bio)) == 256)
> + goto err;
> +
> + for (bio->bi_idx = bio->bi_size = 0; bio->bi_idx < bio->bi_vcnt; bio->bi_idx++)
> + bio->bi_size += bio->bi_io_vec[bio->bi_idx].bv_len;
> +
> + for (bio->bi_idx = 0; bio->bi_idx < bio->bi_vcnt; ) {
> + if (c->sectors_free < min_t(unsigned, bio_sectors(bio), PAGE_SIZE >> 9))
> + pop_bucket(c);
> +
> + if (!(n = bio_kmalloc(GFP_NOIO, 0)))
> + goto err;
> +
> + n->bi_sector = c->current_bucket + c->sb.bucket_size - c->sectors_free;
> + n->bi_size = bio->bi_size;
> + n->bi_vcnt = bio->bi_vcnt - bio->bi_idx;
> + n->bi_io_vec = bio->bi_io_vec + bio->bi_idx;
> +
> + while (bio_sectors(n) > c->sectors_free)
> + n->bi_size -= n->bi_io_vec[--n->bi_vcnt].bv_len;
> +
> + n->bi_max_vecs = n->bi_vcnt;
> + bio->bi_idx += n->bi_vcnt;
> + bio->bi_size -= n->bi_size;
> +
> + k.key = TREE_KEY(dev, bio->bi_sector + written);
> + k.ptr = TREE_PTR(sector_to_gen(c->current_bucket), bio_sectors(n), n->bi_sector);
> +
> + bio->bi_size -= n->bi_size;
> + written += bio_sectors(n);
> + c->sectors_free -= bio_sectors(n);
> +
> + submit_wait_bio(WRITE, n, c, s);
> +
> + if (btree_insert(c, c->sb.btree_root, c->sb.btree_level, &k, new_keys, &t) == -1) {
> + t->q = c;
> + t->end_fn = bio_insert_finish;
> + memcpy(&s->k, &k, sizeof(struct btree_key));
> + }
> + t = NULL;
> + }
> +
> + bio->bi_size = written << 9;
> + bio->bi_idx = 0;
> + return;
> +err:
> + if (!atomic_read(&s->remaining)) {
> + bio->bi_end_io(bio, s->error);
> + bio_put(bio);
> + }
> +}
> +
> +static void request_hook_read(void *p, struct bio *bio, struct search_context *s)
> +{
> + struct list_head *l;
> + struct request_queue *q = p;
> +
> + if (list_empty(&cache_devices))
> + goto out;
> +
> + list_for_each(l, &cache_devices) {
> + int dev;
> + struct cache_device *c = list_entry(l, struct cache_device, list);
> +
> + if ((dev = lookup_dev(c, bio)) == 256)
> + continue;
> +
> + if (btree_search(c, c->sb.btree_root, c->sb.btree_level, dev, bio, &s) == 1) {
> + printk(KERN_DEBUG "bcache: cache hit\n");
> + generic_make_request(bio);
> + return;
> + }
> + }
> +
> + if (s && atomic_read(&s->remaining)) {
> + s->bio = bio;
> + s->q = q;
> + s->end_fn = request_hook_read;
> + return;
> + }
> +
> + if (!s)
> + s = kzalloc(sizeof(struct search_context), GFP_NOIO);
> +
> + printk(KERN_DEBUG "bcache: cache miss, starting write\n");
> + atomic_set(&s->remaining, 1);
> + s->bio = bio;
> + s->q = bio->bi_private;
> + s->end_fn = bio_insert;
> + s->end_io = bio->bi_end_io;
> + s->bi_sector = bio->bi_sector;
> +
> + bio->bi_private = s;
> + bio->bi_end_io = bio_add_work;
> + bio_get(bio);
> + bio_get(bio);
> +
> +out:
> + if (q->make_request_fn(q, bio))
> + generic_make_request(bio);
> +}
> +
> +static void request_hook_write(struct request_queue *q, struct bio *bio, struct search_context *s)
> +{
> + if (q->make_request_fn(q, bio))
> + generic_make_request(bio);
> +}
> +
> +static int request_hook(struct request_queue *q, struct bio *bio)
> +{
> + if (bio->bi_size) {
> + if (bio_rw_flagged(bio, BIO_RW))
> + request_hook_write(q, bio, NULL);
> + else
> + request_hook_read(q, bio, NULL);
> + return 0;
> + } else
> + return 1;
> +}
> +
> +#define write_attribute(n) static struct attribute sysfs_##n = { .name = #n, .mode = S_IWUSR }
> +#define read_attribute(n) static struct attribute sysfs_##n = { .name = #n, .mode = S_IRUSR }
> +
> +write_attribute(register_cache);
> +write_attribute(register_dev);
> +write_attribute(unregister);
> +read_attribute(bucket_size);
> +read_attribute(buckets_used);
> +read_attribute(buckets_free);
> +read_attribute(nbuckets);
> +
> +static void load_priorities(struct cache_device *c)
> +{
> + uint32_t i = 0, per_page = PAGE_SIZE / sizeof(struct bucket_disk);
> + struct bucket_disk *b;
> + struct buffer_head *bh;
> + goto start;
> +
> + for (; i < c->sb.first_free_bucket; i++, b++) {
> + if ((char*) (b + 1) > bh->b_data + PAGE_SIZE) {
> + put_bh(bh);
> +start: bh = __bread(c->bdev, i / per_page + 3, PAGE_SIZE);
> + b = (void*) bh->b_data;
> + }
> +
> + c->buckets[i].priority = le16_to_cpu(b->priority);
> + c->buckets[i].generation = b->generation;
> +
> + if (c->buckets[i].priority == 0 &&
> + c->free_front != c->free_back) {
> + c->freelist[c->free_front++] = i;
> + c->free_front &= c->free_size;
> + } else if (c->buckets[i].priority != ~0)
> + heap_insert(c, i);
> + }
> + put_bh(bh);
> +}
> +
> +static void save_priorities(struct cache_device *c)
> +{
> + uint32_t i = 0, per_page = PAGE_SIZE / sizeof(struct bucket_disk);
> + struct bucket_disk *b;
> + struct buffer_head *bhv[(c->sb.nbuckets - 1) / per_page + 1], *bh = bhv[0];
> + goto start;
> +
> + for (; i < c->sb.nbuckets; i++, b++) {
> + if ((char*) (b + 1) > bh->b_data + PAGE_SIZE) {
> + submit_bh(WRITE, bh++);
> +start: bh = __getblk(c->bdev, (i / per_page + 3), PAGE_SIZE);
> + b = (void*) bh->b_data;
> + }
> +
> + b->priority = cpu_to_le16(c->buckets[i].priority);
> + b->generation = c->buckets[i].generation;
> + }
> + submit_bh(WRITE, bh);
> +
> + for (i = 0; i < (c->sb.nbuckets - 1) / per_page + 1; i++) {
> + wait_on_buffer(bhv[i]);
> + put_bh(bhv[i]);
> + }
> +}
> +
> +static void register_dev_on_cache(struct cache_device *c, int d)
> +{
> + int i, j;
> +
> + c->uuids = __bread(c->bdev, 2, PAGE_SIZE);
> +
> + if (!devices[d]) {
> + printk(KERN_DEBUG "bcache: Tried to register nonexistant device/queue\n");
> + return;
> + }
> +
> + for (i = 0; i < 256; i++) {
> + for (j = 0; j < 16; j++)
> + if (c->uuids->b_data[i*16 + j])
> + break;
> +
> + if (j == 16) {
> + printk(KERN_DEBUG "Inserted new uuid\n");
> + memcpy(c->uuids->b_data + i*16, &uuids[d*16], 16);
> + set_buffer_dirty(c->uuids);
> + break;
> + }
> +
> + if (!memcmp(c->uuids->b_data + i*16, &uuids[d*16], 16)) {
> + printk(KERN_DEBUG "Looked up uuid\n");
> + break;
> + }
> + }
> + put_bh(c->uuids);
> +
> + if (i == 256) {
> + printk(KERN_DEBUG "Aiee! No room for the uuid\n");
> + return;
> + }
> +
> + c->devices[i] = devices[d];
> +}
> +
> +static int parse_uuid(const char *s, char *uuid)
> +{
> + int i, j, x;
> + memset(uuid, 0, 16);
> +
> + //for (i = 0, j = 0; i < strspn(s, "-0123456789:ABCDEFabcdef") && j < 32; i++) {
> + for (i = 0, j = 0; s[i] && j < 32; i++) {
> + x = s[i] | 32;
> +
> + if (x == ':' || x == '-')
> + continue;
> +
> + if (x > 'f' || x < '0')
> + return i;
> +
> + if (x <= '9')
> + x -= '0';
> + else if (x >= 'a')
> + x -= 'a' - 10;
> + else
> + return i;
> +
> + x <<= ((j & 1) << 2);
> + uuid[j++ >> 1] |= x;
> + }
> + return i;
> +}
> +
> +static void register_dev(const char *buffer, size_t size)
> +{
> + int i, j;
> + char *path;
> + unsigned char uuid[16];
> + struct block_device *bdev;
> + struct list_head *l;
> +
> + i = parse_uuid(buffer, &uuid[0]);
> +
> + if (i < 4) {
> + printk(KERN_DEBUG "bcache: Bad uuid\n");
> + return;
> + }
> +
> + path = kmalloc(size + 1 - i, GFP_KERNEL);
> + if (!path) {
> + printk(KERN_DEBUG "bcache: kmalloc error\n");
> + return;
> + }
> + strcpy(path, skip_spaces(buffer + i));
> + bdev = lookup_bdev(strim(path));
> +
> + if (IS_ERR(bdev)) {
> + printk(KERN_DEBUG "bcache: Failed to open %s\n", path);
> + kfree(path);
> + return;
> + }
> +
> + for (i = 0; i < 256; i++) {
> + for (j = 0; j < 16; j++)
> + if (uuids[i*16 + j])
> + break;
> +
> + if (j == 16)
> + break;
> +
> + if (!memcmp(&uuids[i*16], uuid, 16)) {
> + printk(KERN_DEBUG "bcache: %s already registered\n", path);
> + kfree(path);
> + return;
> + }
> + }
> + memcpy(&uuids[i*16], uuid, 16);
> + devices[i] = bdev;
> +
> + list_for_each(l, &cache_devices)
> + register_dev_on_cache(list_entry(l, struct cache_device, list), i);
> +
> + bdev->bd_cache_fn = request_hook;
> + printk(KERN_DEBUG "bcache: Caching %s\n", path);
> + kfree(path);
> +}
> +
> +static ssize_t store_cache(struct kobject *kobj, struct attribute *attr, const char *buffer, size_t size)
> +{
> + if (attr == &sysfs_unregister) {
> + // kobject_put(kobj);
> + }
> + return size;
> +}
> +
> +static ssize_t show_cache(struct kobject *kobj, struct attribute *attr, char *buffer)
> +{
> + struct cache_device *c = container_of(kobj, struct cache_device, kobj);
> + if (attr == &sysfs_bucket_size)
> + return snprintf(buffer, PAGE_SIZE, "%i\n", c->sb.bucket_size * 512);
> + if (attr == &sysfs_buckets_used)
> + return snprintf(buffer, PAGE_SIZE, "%lli\n", c->sb.first_free_bucket - c->sb.first_bucket);
> + if (attr == &sysfs_buckets_free)
> + return snprintf(buffer, PAGE_SIZE, "%lli\n", c->sb.nbuckets - c->sb.first_free_bucket + c->sb.first_bucket);
> + if (attr == &sysfs_nbuckets)
> + return snprintf(buffer, PAGE_SIZE, "%lli\n", c->sb.nbuckets);
> + return 0;
> +}
> +
> +static void unregister_cache(struct kobject *k)
> +{
> + struct cache_sb *s;
> + struct cache_device *c = container_of(k, struct cache_device, kobj);
> + struct buffer_head *bh = __getblk(c->bdev, 1, PAGE_SIZE);
> +
> + list_del(&c->list);
> +
> + save_priorities(c);
> + put_bh(c->uuids);
> + vfree(c->buckets);
> + vfree(c->heap);
> +
> + s = (struct cache_sb*) bh->b_data;
> + s->version = cpu_to_le32(c->sb.version);
> + s->block_size = cpu_to_le16(c->sb.block_size);
> + s->bucket_size = cpu_to_le16(c->sb.bucket_size);
> + s->journal_start = cpu_to_le32(c->sb.journal_start);
> + s->first_bucket = cpu_to_le32(c->sb.first_bucket);
> + s->nbuckets = cpu_to_le64(c->sb.nbuckets);
> + s->first_free_bucket = cpu_to_le64(c->sb.first_free_bucket);
> + s->btree_root = cpu_to_le64(c->sb.btree_root);
> + s->btree_level = cpu_to_le16(c->sb.btree_level);
> +
> + submit_bh(WRITE, bh);
> + put_bh(bh);
> +
> + close_bdev_exclusive(c->bdev, FMODE_READ|FMODE_WRITE);
> + module_put(c->owner);
> + kfree(c);
> +}
> +
> +static void register_cache(const char *buffer, size_t size)
> +{
> + char *err = NULL, *path, b[BDEVNAME_SIZE];
> + int i;
> + struct buffer_head *bh = NULL;
> + struct block_device *bdev;
> + struct cache_sb *s;
> + struct cache_device *c;
> + struct page *p;
> +
> + static struct attribute *files[] = {
> + &sysfs_unregister,
> + &sysfs_bucket_size,
> + &sysfs_buckets_used,
> + &sysfs_buckets_free,
> + &sysfs_nbuckets,
> + NULL
> + };
> + const static struct sysfs_ops ops = {
> + .show = show_cache,
> + .store = store_cache
> + };
> + static struct kobj_type cache_obj = {
> + .release = unregister_cache,
> + .sysfs_ops = &ops,
> + .default_attrs = files
> + };
> +
> + if (!try_module_get(THIS_MODULE))
> + return;
> +
> + path = kmalloc(size + 1, GFP_KERNEL);
> + strcpy(path, skip_spaces(buffer));
> +
> + bdev = open_bdev_exclusive(strim(path), FMODE_READ|FMODE_WRITE, NULL);
> + if (IS_ERR(bdev)) {
> + err = "Failed to open cache device";
> + goto err_no_alloc;
> + }
> + set_blocksize(bdev, PAGE_SIZE);
> +
> + p = read_mapping_page_async(bdev->bd_inode->i_mapping, 1, NULL);
> +
> + bh = __bread(bdev, 1, PAGE_SIZE);
> + err = "IO error";
> + if (!bh)
> + goto err_no_alloc;
> + s = (struct cache_sb*) bh->b_data;
> +
> + err = "Insufficient memory";
> + if (!(c = kzalloc(sizeof(struct cache_device), GFP_KERNEL)))
> + goto err_no_alloc;
> +
> + err = "IO error";
> + c->uuids = __bread(bdev, 2, PAGE_SIZE);
> + if (!c->uuids)
> + goto err;
> +
> + err = "Not a bcache superblock";
> + if (memcmp(s->magic, bcache_magic, 16))
> + goto err;
> +
> + c->owner = THIS_MODULE;
> + c->bdev = bdev;
> + c->sb.version = le32_to_cpu(s->version);
> + c->sb.block_size = le16_to_cpu(s->block_size);
> + c->sb.bucket_size = le16_to_cpu(s->bucket_size);
> + c->sb.journal_start = le32_to_cpu(s->journal_start);
> + c->sb.first_bucket = le32_to_cpu(s->first_bucket);
> + c->sb.nbuckets = le64_to_cpu(s->nbuckets);
> + c->sb.first_free_bucket = le64_to_cpu(s->first_free_bucket);
> + c->sb.btree_root = le64_to_cpu(s->btree_root);
> + c->sb.btree_level = le16_to_cpu(s->btree_level);
> +
> + err = "Unsupported superblock version";
> + if (c->sb.version > 0)
> + goto err;
> +
> + // buckets must be multiple of page size, at least for now
> + err = "Bad block/bucket size";
> + if (!c->sb.block_size ||
> + c->sb.bucket_size & 7 ||
> + c->sb.bucket_size < c->sb.block_size)
> + goto err;
> +
> + err = "Invalid superblock: journal overwrites superblock/priorities";
> + if (c->sb.journal_start * c->sb.bucket_size <
> + 24 + (c->sb.nbuckets * sizeof(struct bucket)) / 512)
> + goto err;
> +
> + err = "Invalid superblock";
> + if (c->sb.first_bucket < c->sb.journal_start ||
> + c->sb.first_free_bucket > c->sb.nbuckets ||
> + get_capacity(bdev->bd_disk) < bucket_to_sector(c->sb.nbuckets))
> + goto err;
> +
> + err = "Invalid superblock";
> + if (c->sb.btree_root < c->sb.first_bucket * c->sb.bucket_size ||
> + c->sb.btree_root >= bucket_to_sector(c->sb.first_free_bucket))
> + goto err;
> +
> + c->free_size = 1;
> + while (c->free_size << 6 < c->sb.nbuckets)
> + c->free_size <<= 1;
> +
> + err = "vmalloc error";
> + c->heap = vmalloc(c->sb.nbuckets * sizeof(long));
> + c->buckets = vmalloc(c->sb.nbuckets * sizeof(struct bucket));
> + c->freelist = vmalloc(c->free_size-- * sizeof(long));
> + if (!c->heap || !c->buckets || !c->freelist)
> + goto err;
> +
> + load_priorities(c);
> + put_bh(c->uuids);
> +
> + for (i = 0; i < 256 && devices[i]; i++)
> + register_dev_on_cache(c, i);
> +
> + err = "kobj create error";
> + bdevname(bdev, b);
> + if (!kobject_get(bcache_kobj))
> + goto err;
> +
> + if (kobject_init_and_add(&c->kobj, &cache_obj,
> + bcache_kobj,
> + "%s", b))
> + goto err;
> +
> + list_add(&c->list, &cache_devices);
> +
> + printk(KERN_DEBUG "bcache: Loaded cache device %s\n", path);
> + kfree(path);
> + return;
> +err:
> + if (c->kobj.state_initialized)
> + kobject_put(&c->kobj);
> + if (c->uuids)
> + put_bh(c->uuids);
> + if (c->buckets)
> + vfree(c->buckets);
> + if (c->heap)
> + vfree(c->heap);
> + kfree(c);
> +err_no_alloc:
> + if (bh)
> + put_bh(bh);
> + if (!IS_ERR(bdev))
> + close_bdev_exclusive(bdev, FMODE_READ|FMODE_WRITE);
> + printk(KERN_DEBUG "bcache: error opening %s: %s\n", path, err);
> + kfree(path);
> + return;
> +}
> +
> +static ssize_t store(struct kobject *kobj, struct attribute *attr, const char *buffer, size_t size)
> +{
> + if (attr == &sysfs_register_cache)
> + register_cache(buffer, size);
> + if (attr == &sysfs_register_dev)
> + register_dev(buffer, size);
> + return size;
> +}
> +
> +static int __init bcache_init(void)
> +{
> + const static struct attribute *files[] = { &sysfs_register_cache, &sysfs_register_dev, NULL};
> + const static struct sysfs_ops ops = { .show = NULL, .store = store };
> +
> + printk(KERN_DEBUG "bcache loading\n");
> +
> + bcache_kobj = kobject_create_and_add("bcache", kernel_kobj);
> + if (!bcache_kobj)
> + return -ENOMEM;
> +
> + bcache_kobj->ktype->sysfs_ops = &ops;
> + return sysfs_create_files(bcache_kobj, files);
> +}
> +
> +static void bcache_exit(void)
> +{
> + int i;
> + struct list_head *l;
> +
> + sysfs_remove_file(bcache_kobj, &sysfs_register_cache);
> + sysfs_remove_file(bcache_kobj, &sysfs_register_dev);
> +
> + for (i = 0; i < 256; i++)
> + if (devices[i] && devices[i])
> + devices[i]->bd_cache_fn = NULL;
> +
> + list_for_each(l, &cache_devices)
> + kobject_put(&list_entry(l, struct cache_device, list)->kobj);
> +}
> +
> +module_init(bcache_init);
> +module_exit(bcache_exit);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9fe174d..41b4d21 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1405,7 +1405,7 @@ static inline void __generic_make_request(struct bio *bio)
> {
> struct request_queue *q;
> sector_t old_sector;
> - int ret, nr_sectors = bio_sectors(bio);
> + int ret = 1, nr_sectors = bio_sectors(bio);
> dev_t old_dev;
> int err = -EIO;
>
> @@ -1478,7 +1478,10 @@ static inline void __generic_make_request(struct bio *bio)
>
> trace_block_bio_queue(q, bio);
>
> - ret = q->make_request_fn(q, bio);
> + if (bio->bi_bdev->bd_cache_fn)
> + ret = bio->bi_bdev->bd_cache_fn(q, bio);
> + if (ret)
> + ret = q->make_request_fn(q, bio);
> } while (ret);
>
> return;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 10b8ded..aca254c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -514,6 +514,8 @@ enum positive_aop_returns {
> struct page;
> struct address_space;
> struct writeback_control;
> +struct bio;
> +struct request_queue;
>
> struct iov_iter {
> const struct iovec *iov;
> @@ -664,6 +666,8 @@ struct block_device {
> int bd_invalidated;
> struct gendisk * bd_disk;
> struct list_head bd_list;
> +
> + int (*bd_cache_fn)(struct request_queue *q, struct bio *bio);
> /*
> * Private data. You must have bd_claim'ed the block_device
> * to use this. NOTE: bd_claim allows an owner to claim
>
>
> /* make-bcache.c - initialize a cache device */
> #include <fcntl.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
>
> const char bcache_magic[] = { 0xc6, 0x85, 0x73, 0xf6, 0x4e, 0x1a, 0x45, 0xca, 0x82, 0x65, 0xf5, 0x7f, 0x48, 0xba, 0x6d, 0x81 };
>
> struct cache_sb {
> uint8_t magic[16];
> uint32_t version;
> uint16_t block_size; /* sectors */
> uint16_t bucket_size; /* sectors */
> uint32_t journal_start; /* buckets */
> uint32_t first_bucket; /* start of data */
> uint64_t nbuckets; /* device size */
> uint64_t first_free_bucket; /* buckets that have never been used, only increments */
> uint64_t btree_root;
> uint16_t btree_level;
> };
>
> struct bucket_disk {
> uint16_t priority;
> uint8_t generation;
> };
>
> struct btree_node_header {
> uint32_t csum;
> uint32_t nkeys;
> uint64_t random;
> };
>
> char zero[4096];
>
> int main(int argc, char **argv)
> {
> int ret;
> if (argc < 2) {
> printf("Please supply a device\n");
> return 0;
> }
>
> int fd = open(argv[1], O_RDWR);
> if (!fd) {
> perror("Can't open dev\n");
> return 0;
> }
>
> struct stat statbuf;
> if (fstat(fd, &statbuf)) {
> perror("stat error\n");
> return 0;
> }
>
> struct cache_sb sb;
> memcpy(sb.magic, bcache_magic, 16);
> sb.version = 0;
> sb.block_size = 8;
> sb.bucket_size = 256;
> sb.nbuckets = statbuf.st_size / (sb.bucket_size * 512);
>
> int priority_pages;
> do {
> priority_pages = --sb.nbuckets / (4096 / sizeof(struct bucket_disk)) + 4;
>
> } while (sb.nbuckets + (priority_pages - 1) / (sb.bucket_size / 8) + 1 >
> statbuf.st_size / (sb.bucket_size * 512));
>
> sb.journal_start = (priority_pages - 1) / (sb.bucket_size / 8) + 1;
> sb.first_bucket = sb.journal_start;
> sb.first_free_bucket = 1;
> sb.btree_root = sb.first_bucket * sb.bucket_size;
> sb.btree_level = 0;
>
> printf("block_size: %u\n"
> "bucket_size: %u\n"
> "journal_start: %u\n"
> "first_bucket: %u\n"
> "nbuckets: %llu\n"
> "first_free_bucket: %llu\n",
> sb.block_size,
> sb.bucket_size,
> sb.journal_start,
> sb.first_bucket,
> sb.nbuckets,
> sb.first_free_bucket);
>
> lseek(fd, 4096, SEEK_SET);
> for (int i = 0; i < priority_pages; i++)
> for (int j = 0; j < 4096; j += ret)
> if ((ret = write(fd, &zero[0], 4096 - j)) < 0)
> goto err;
>
> lseek(fd, 4096, SEEK_SET);
> for (int i = 0; i < sizeof(struct cache_sb); i += ret)
> if ((ret = write(fd, &sb + i, sizeof(struct cache_sb) - i)) < 0)
> goto err;
>
> struct btree_node_header n;
> n.nkeys = 0;
> n.random = 42;
>
> lseek(fd, sb.bucket_size * 512 * sb.first_bucket, SEEK_SET);
> for (int i = 0; i < sizeof(struct btree_node_header); i += ret)
> if ((ret = write(fd, &sb + i, sizeof(struct btree_node_header) - i)) < 0)
> goto err;
>
> return 0;
> err:
> perror("write error\n");
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mel Gorman: "Re: [PATCH 2/6] change alloc function in pcpu_alloc_pages"
Previous message: Prarit Bhargava: "[PATCH]: x86: remove extra bootmem.h from arch/x86/mm/init_64.c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]