Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

From: Jared Hulbert
Date: Tue Feb 02 2016 - 03:05:14 EST


On Mon, Feb 1, 2016 at 10:46 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
>> On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote:
>>>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote:
>>>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>>>> > > I guess I need to go off and understand if we can have DAX mappings on such a
>>>> > > device. If we can, we may have a problem - we can get the block_device from
>>>> > > get_block() in I/O path and the various fault paths, but we don't have access
>>>> > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid
>>>> > > needing it the normal case by storing the sector results from get_block() in
>>>> > > the radix tree.
>>>> >
>>>> > I think we're doing it wrong by storing the sector in the radix tree; we'd
>>>> > really need to store both the sector and the bdev which is too much data.
>>>> >
>>>> > If we store the PFN of the underlying page instead, we don't have this
>>>> > problem. Instead, we have a different problem; of the device going
>>>> > away under us. I'm trying to find the code which tears down PTEs when
>>>> > the device goes away, and I'm not seeing it. What do we do about user
>>>> > mappings of the device?
>>>>
>>>> So I don't have a strong opinion whether storing PFN or sector is better.
>>>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special
>>>> cases like inodes on XFS RT devices would be IMHO fine.
>>>
>>> We need to support alternate devices.
>>
>> Embedded devices trying to use NOR Flash to free up RAM was
>> historically one of the more prevalent real world uses of the old
>> filemap_xip.c code although the users never made it to mainline. So I
>> spent some time last week trying to figure out how to make a subset of
>> DAX not depend on CONFIG_BLOCK. It was a very frustrating and
>> unfruitful experience. I discarded my main conclusion as impractical,
>> but now that I see the difficultly DAX faces in dealing with
>> "alternate devices" especially some of the crazy stuff btrfs can do, I
>> wonder if it's not so crazy after all.
>>
>> Lets stop calling bdev_direct_access() directly from DAX. Let the
>> filesystems do it.
>>
>> Sure we could enable generic_dax_direct_access() helper for the
>> filesystems that only support single devices to make it easy. But XFS
>> and btrfs for example, have to do the work of figuring out what bdev
>> is required and then calling bdev_direct_access().
>>
>> My reasoning is that the filesystem knows how to map inodes and
>> offsets to devices and sectors, no matter how complex that is. It
>> would even enable a filesystem to intelligently use a mix of
>> direct_access and regular block devices down the road. Of course it
>> would also make the block-less solution doable.
>>
>> Good idea? Stupid idea?
>
> The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems
> the CONFIG_BLOCK=n is an incremental feature in its own right. What
> driver and what filesystem are looking to enable this XIP support in?

Well... as CONFIG_BLOCK was not required with filemap_xip.c for a
decade. This CONFIG_BLOCK dependency is a result of an incremental
feature from a certain point of view ;)

The obvious 'driver' is physical RAM without a particular driver.
Remember please I'm talking about embedded. RAM measured in MiB and
funky one off hardware etc. In the embedded world there are lots of
ways that persistent memory has been supported in device specific ways
without the new fancypants NFIT and Intel instructions, so frankly
they don't fit in the PMEM stuff. Maybe they could be supported in
PMEM but not without effort to bring embedded players to the table.

The other drivers are the MTD drivers, probably as read-only for now.
But the paradigm there isn't so different from what PMEM looks like
with asymmetric read/write capabilities.

The filesystem I'm concerned with is AXFS
(https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
Which I've been planning on trying to merge again due to a recent
resurgence of interest. The device model for AXFS is... weird. It
can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
block, and unmanaged physical memory. It's a terribly useful model
for embedded. Anyway AXFS is readonly so hacking in a read only
dax_fault_nodev() and dax_file_read() would work fine, looks easy
enough. But... it would be cool if similar small embedded focused RW
filesystems were enabled.

I don't expect you to taint DAX with design requirements for this
stuff that it wasn't built for, nobody ends up happy in that case.
However, if enabling the filesystem to manage the bdev_direct_access()
interactions solves some of the "alternate device" problems you are
discussing here, then there is a chance we can accommodate both.
Sometimes that works.

So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the
focus anyway. Does it help to support the weirder XFS and btrfs
device models to enable the filesystem to handle the
bdev_direct_access() stuff?