[Ksummit-2013-discuss] [ATTEND] Kernel Logical Volume Management API

Wed Jul 24 08:26:48 UTC 2013

On 07/23/2013 12:50 AM, NeilBrown wrote:
> On Sat, 20 Jul 2013 04:19:41 +0000 Daniel Phillips wrote:
>
>> Hi Niel,
>
> Hi Daneil :-)

Hey, I am not the one on the wrong side of the "i before the e" rule :-)
You will note that I spelled your name correctly in another fork of this 
thread, this really was a just keyboard transposition here.

Comments inline.

>> Thanks for quick response. Agree on nearly all, except that I doubt
>> there is an existing use case for including filesystems per se in the
>> storage abstraction, unless you are talking about filesystems that can
>> present as block devices. More comments inline.
>
> I see filesystems on both ends of the stack.  A filesystem could provide a
> file which gets mapped to a "store" just like the "loop" driver.
> And a filesystem could use a "store" to store data.

OK, violent agreement, but you want to go a step further and provide a 
block device view of any files on any filesystem. It is doable, however 
deadlock issues would need to be addressed. Historically, that has 
proved challenging.

> Block devices bother me.  When you open /dev/sda what happens is that a
> trivial filesystem is instantiated which presents exactly one file with a
> linear mapping between file address and device address.  The one file from
> that filesystem is opened and you are given on 'fd' for that file.
> All IO then goes through the page cache just like it does with most
> filesystems.

A bit of a tangent, but a filesystem may handle metadata IO however it 
wants to. A filesystem does not need to use the traditional bdev, it can 
just ignore that constellation of support code and do metadata transfers 
however it wants. However, it is convenient and efficient to buffer 
metadata in a page cache. Tux3 creates its own volume page cache rather 
than using the traditional bdev, over which core vfs exercises 
inappropriate flushing behaviour that violate our ACID semantics.

> But when a filesystem opens a block device, it gets direct access to the
> "gendisk" underneath and doesn't go through the page cache (though some
> filesystem like ext3 still use that same page cache for metadata, though not
> for data ... which I think is really confusing).

I think you know this, but files are logically mapped but the volume is 
directly mapped. The page cache itself is unaware of this - the 
address_space_operations define the difference, by handling a cache page 
index one way or the other.

> A filesystem doesn't really a need a "block device" (NFS certainly doesn't!).
> It could use a "store" directly.  This could remove some of the potential
> confusion of caches between accessing via the filesystem or accessing
> via /dev/foo.

OK, this is the crux. I presume that a "store" is a kind of block device 
in the sense that it supports the bio API. Suppose you wanted to 
implement something resembling RaidZ where each extent is its own 
stripe, and suppose that the needed checksumming and fan-out/fan-in 
transfer logic is encapsulated in a "store" that exposes an additional 
API for controlling the precise Raid operations to be performed, bio by 
bio. Then the filesystem would be able to avoid implementing much of 
that tricky and verbose code, and just take care of what only it knows 
how to do: remember which extent is supposed to be stored with which 
topology.

In this way, we could do what Mr Bonwick suggests is impossible: 
implement a variable geometry raid technique without reimplementing the 
entire volume manager within the filesystem, as ZFS does. We can do it 
efficiently too. Whether you telescope or not, the brunt of the work 
comes down to bio handling, which can be abstracted.

> I would also allow the store to have a broader interface than block devices.
> Block device have a strictly linear interface.  I can well imagine that a
> filesystem could benefit from a 2 (or more) dimensional interface to the
> store.

A "store" should be able to implement any interfaces it choses, provided 
it also implements the bio API and some base management API for stores.

> In particular if a 'store' is to grow, it could grow in number of
> devices, or grow in the size of each device.  Trying to allow that while
> keeping stable "block device" addresses requires some arbitrary multiplexing -
> dividing the address up with some bits for one dimension, some bit for the
> other, resulting in a non-contiguous block device.  Such a thing could easily
> lead to confusion.
> Have 2-dimensional stores which simply to not present as a block device at
> all would be a real gain I think.

Hmm, that one would need a specific example.

>> On 07/20/2013 02:26 AM, NeilBrown wrote:
>> I would hope that md could fairly easily adopt at least a subset of the
>> new API, to hive off some busy work that is currently through-coded. I
>> was thinking specifically about bio completion nesting here, but there
>> might well be other accessible bits. I thought that we could at least
>> maximize the chance of being able to provide something usable by md by
>> drawing on input from experienced md devs.
>
> I don't know what issue you see with bio completion nesting - it seems quite
> forward to me.

Efficiency. Right now, you allocate batches of bios on the way down the 
raid stack and free them on completion. The only thing hiding the CPU 
overhead of that is the slowness of traditional media. That state of 
grace is about to end.

To be sure, the old way can continue to serve, but there is no real 
argument for hanging on to it when something more efficient is possible. 
As a bonus, the code gets cleaner and more obviously correct.

> For some drivers, the bi_end_io function just finds the parent and calls its
> bi_end_io.  For others the parent request would need to be queued for
> handling by some thread. The details of that would probably vary a lot
> between drivers.

I managed to convert the known users pretty easily. To be sure, I recall 
at least one bug report the first time round. This time we would 
actually have some manpower for the work.

In any case, now that you mention it, device management boilerplate 
would probably be a lot more of an initial value add for md. I agree 
completely: a virtual device ("store") should be completely constructed 
and functional before being placed in the device namespace. In some 
cases that may not be possible, so other mechanisms would need to be 
used, but for any case where it is possible, not completing the setup 
work before throwing the device into the fray is just... wrong.

> So I think that to find some commonality in bio completion, you would really
> need to have commonality in request queuing and thread management.  Maybe
> that is a good thing, I don't know.

Thread management?

> Is there really any value in "bumpless"?  How often do you reconfigure and
> how big is the bump?  Certainly when I'm converting a RAID1 to a RAID5 (for
> example) I want a clean break.  Having requests in-flight for both at the
> same time is way more complexity than I would be interested in.  But maybe
> you have a different use case which is more compelling.

Currently, dm will suspend a device in order to modify it. Sometimes the 
modification is very frequent and regular, like repeatedly moving a 
mirror along a pair devices in order to copy one to the other by 
mirroring to effect a "physical device move". (Arguably the best trick 
that lvm2 does.) During this process the aggregate device is always 
live, except for short suspends during the mirror moves. While a device 
is suspended, dm just queues up IO requests so that the caller can 
continue to submit its async IO storm if that is what it is doing. This 
queue can get big and eat memory, and unblocking the queue can create a 
thundering herd.

Alasdair is more familiar with the issues that came up in practice, 
which may indeed all be addressed by now. But given that in most cases, 
we can actually just handling IO during the transition by thinking about 
what maps to what, why shouldn't we?

(If you are changing the device geometry in such a way that no stable 
mapping is preserved then the block device probably should not be 
handling application IO at all because any filesystem stored on it will 
be destroyed.)

To be sure, a reliable suspend/resume mechanism for aggregate block 
devices is needed anyway, so that is base functionality. Bumpless 
operation is a nicety that could be done down the road. I want it, just 
because I hate bumps, and it really doesn't seem harder than a lot of 
things we do on the fly. Certainly easier than memory hotplug.

> If I remember correctly, a dm block device contains a "table", a "table"
> contains one or more "targets".  A "target" can container one or more "block
> devices".  So there are 3 levels before any recursion is possible.  I'm hoping
> for 2, or possibly 1.

Right, too many levels. Also, every dm device is actually a 
concatenation of targets, whereas a naive person would think that if you 
want a concatenation, then that is be a particular kind of target. In 
other words, concatenating devices should not be core functionality of 
the abstraction, it should be a specialization. Or in other other words, 
why do we need that mechanism on the hot path of every virtual device, 
whether used or not?

> So "volume" would be a bit like "dm-target", but would include the
> functionality of "dm-table" and would a bit like "mddev" as well.

Ack.

>>> An important way that this is different to md (and I think dm though I'm a
>>> bit foggy on the details there) is that with 'md' the first thing you create
>>> is the block device.  Then you configure it and activate it and suddenly the
>>> block device has content.  This doesn't fit at all well with udev and it
>>> isn't really udev's fault.  It is md's fault.
>>
>> Could you dive deeper into that one, i.e., describe the sequence of
>> operations that bothers udev? As it happens, device mapper likes to
>> create targets separately from block devices. Maybe there is something
>> fundamentally right about that.
>
> When you create a block device with add_disk() it calls register_disk() which
> generate KOBJ_ADD which goes to udev. Udev thinks "oh, a new block device,
> better see what is inside it" etc.  But for both dm and md the "inside"
> doesn't exist yet.  A little while later when things are set up a KOBJ_CHANGE
> event has to be generated so udev goes looking again.
> This works (because we had to make it work) but is rather ugly.
> It would be much better if the data was available the moment that KOBJ_ADD was
> generated.
>
> KOBJ_CHANGE is really for devices with removable media.  Modelling logical
> volume management as "removable media" doesn't seem right.

OK, thanks. Yes, certainly a flaw and a correctable one.

> You might be able to create dm-targets without the block device (I'm not
> sure), but you cannot create the dm-table.
> "disk_add()" is called when a "mapped_device" is created, and to load a
> table, you need to have a mapped_device.

I seem to recall that the way lvm2 resolves this is to keep the device 
suspended until the table load is done. I would far rather that the 
device not become visible until it is ready to act like a device. Maybe 
Alasdair can comment, CC added.

> I would prefer the table/target/volume/mddev were all assembled first, then
> in a single atomic operation a gendisk linked to that thing were given to
> add_disk to create the block device and make it visible to udev and everybody
> else.

Right, 20 20 hindsight.

Well, KS is already providing a useful forum to kick some ideas around 
and you are involved whether attending or not. I really think we need to 
build some momentum on this earlier rather than later, and keep that 
going through the other conferences. It doesn't really matter whether 
absolutely every interested party is there on the ground, what matters 
is whether a critical mass are present, which seems more than likely.

Regards,

Daniel