[Ksummit-2013-discuss] [ATTEND] Kernel Logical Volume Management API
Daniel Phillips
d.phillips at partner.samsung.com
Wed Jul 24 08:26:48 UTC 2013
On 07/23/2013 12:50 AM, NeilBrown wrote:
> On Sat, 20 Jul 2013 04:19:41 +0000 Daniel Phillips wrote:
>
>> Hi Niel,
>
> Hi Daneil :-)
Hey, I am not the one on the wrong side of the "i before the e" rule :-)
You will note that I spelled your name correctly in another fork of this
thread, this really was a just keyboard transposition here.
Comments inline.
>> Thanks for quick response. Agree on nearly all, except that I doubt
>> there is an existing use case for including filesystems per se in the
>> storage abstraction, unless you are talking about filesystems that can
>> present as block devices. More comments inline.
>
> I see filesystems on both ends of the stack. A filesystem could provide a
> file which gets mapped to a "store" just like the "loop" driver.
> And a filesystem could use a "store" to store data.
OK, violent agreement, but you want to go a step further and provide a
block device view of any files on any filesystem. It is doable, however
deadlock issues would need to be addressed. Historically, that has
proved challenging.
> Block devices bother me. When you open /dev/sda what happens is that a
> trivial filesystem is instantiated which presents exactly one file with a
> linear mapping between file address and device address. The one file from
> that filesystem is opened and you are given on 'fd' for that file.
> All IO then goes through the page cache just like it does with most
> filesystems.
A bit of a tangent, but a filesystem may handle metadata IO however it
wants to. A filesystem does not need to use the traditional bdev, it can
just ignore that constellation of support code and do metadata transfers
however it wants. However, it is convenient and efficient to buffer
metadata in a page cache. Tux3 creates its own volume page cache rather
than using the traditional bdev, over which core vfs exercises
inappropriate flushing behaviour that violate our ACID semantics.
> But when a filesystem opens a block device, it gets direct access to the
> "gendisk" underneath and doesn't go through the page cache (though some
> filesystem like ext3 still use that same page cache for metadata, though not
> for data ... which I think is really confusing).
I think you know this, but files are logically mapped but the volume is
directly mapped. The page cache itself is unaware of this - the
address_space_operations define the difference, by handling a cache page
index one way or the other.
> A filesystem doesn't really a need a "block device" (NFS certainly doesn't!).
> It could use a "store" directly. This could remove some of the potential
> confusion of caches between accessing via the filesystem or accessing
> via /dev/foo.
OK, this is the crux. I presume that a "store" is a kind of block device
in the sense that it supports the bio API. Suppose you wanted to
implement something resembling RaidZ where each extent is its own
stripe, and suppose that the needed checksumming and fan-out/fan-in
transfer logic is encapsulated in a "store" that exposes an additional
API for controlling the precise Raid operations to be performed, bio by
bio. Then the filesystem would be able to avoid implementing much of
that tricky and verbose code, and just take care of what only it knows
how to do: remember which extent is supposed to be stored with which
topology.
In this way, we could do what Mr Bonwick suggests is impossible:
implement a variable geometry raid technique without reimplementing the
entire volume manager within the filesystem, as ZFS does. We can do it
efficiently too. Whether you telescope or not, the brunt of the work
comes down to bio handling, which can be abstracted.
> I would also allow the store to have a broader interface than block devices.
> Block device have a strictly linear interface. I can well imagine that a
> filesystem could benefit from a 2 (or more) dimensional interface to the
> store.
A "store" should be able to implement any interfaces it choses, provided
it also implements the bio API and some base management API for stores.
> In particular if a 'store' is to grow, it could grow in number of
> devices, or grow in the size of each device. Trying to allow that while
> keeping stable "block device" addresses requires some arbitrary multiplexing -
> dividing the address up with some bits for one dimension, some bit for the
> other, resulting in a non-contiguous block device. Such a thing could easily
> lead to confusion.
> Have 2-dimensional stores which simply to not present as a block device at
> all would be a real gain I think.
Hmm, that one would need a specific example.
>> On 07/20/2013 02:26 AM, NeilBrown wrote:
>> I would hope that md could fairly easily adopt at least a subset of the
>> new API, to hive off some busy work that is currently through-coded. I
>> was thinking specifically about bio completion nesting here, but there
>> might well be other accessible bits. I thought that we could at least
>> maximize the chance of being able to provide something usable by md by
>> drawing on input from experienced md devs.
>
> I don't know what issue you see with bio completion nesting - it seems quite
> forward to me.
Efficiency. Right now, you allocate batches of bios on the way down the
raid stack and free them on completion. The only thing hiding the CPU
overhead of that is the slowness of traditional media. That state of
grace is about to end.
To be sure, the old way can continue to serve, but there is no real
argument for hanging on to it when something more efficient is possible.
As a bonus, the code gets cleaner and more obviously correct.
> For some drivers, the bi_end_io function just finds the parent and calls its
> bi_end_io. For others the parent request would need to be queued for
> handling by some thread. The details of that would probably vary a lot
> between drivers.
I managed to convert the known users pretty easily. To be sure, I recall
at least one bug report the first time round. This time we would
actually have some manpower for the work.
In any case, now that you mention it, device management boilerplate
would probably be a lot more of an initial value add for md. I agree
completely: a virtual device ("store") should be completely constructed
and functional before being placed in the device namespace. In some
cases that may not be possible, so other mechanisms would need to be
used, but for any case where it is possible, not completing the setup
work before throwing the device into the fray is just... wrong.
> So I think that to find some commonality in bio completion, you would really
> need to have commonality in request queuing and thread management. Maybe
> that is a good thing, I don't know.
Thread management?
> Is there really any value in "bumpless"? How often do you reconfigure and
> how big is the bump? Certainly when I'm converting a RAID1 to a RAID5 (for
> example) I want a clean break. Having requests in-flight for both at the
> same time is way more complexity than I would be interested in. But maybe
> you have a different use case which is more compelling.
Currently, dm will suspend a device in order to modify it. Sometimes the
modification is very frequent and regular, like repeatedly moving a
mirror along a pair devices in order to copy one to the other by
mirroring to effect a "physical device move". (Arguably the best trick
that lvm2 does.) During this process the aggregate device is always
live, except for short suspends during the mirror moves. While a device
is suspended, dm just queues up IO requests so that the caller can
continue to submit its async IO storm if that is what it is doing. This
queue can get big and eat memory, and unblocking the queue can create a
thundering herd.
Alasdair is more familiar with the issues that came up in practice,
which may indeed all be addressed by now. But given that in most cases,
we can actually just handling IO during the transition by thinking about
what maps to what, why shouldn't we?
(If you are changing the device geometry in such a way that no stable
mapping is preserved then the block device probably should not be
handling application IO at all because any filesystem stored on it will
be destroyed.)
To be sure, a reliable suspend/resume mechanism for aggregate block
devices is needed anyway, so that is base functionality. Bumpless
operation is a nicety that could be done down the road. I want it, just
because I hate bumps, and it really doesn't seem harder than a lot of
things we do on the fly. Certainly easier than memory hotplug.
> If I remember correctly, a dm block device contains a "table", a "table"
> contains one or more "targets". A "target" can container one or more "block
> devices". So there are 3 levels before any recursion is possible. I'm hoping
> for 2, or possibly 1.
Right, too many levels. Also, every dm device is actually a
concatenation of targets, whereas a naive person would think that if you
want a concatenation, then that is be a particular kind of target. In
other words, concatenating devices should not be core functionality of
the abstraction, it should be a specialization. Or in other other words,
why do we need that mechanism on the hot path of every virtual device,
whether used or not?
> So "volume" would be a bit like "dm-target", but would include the
> functionality of "dm-table" and would a bit like "mddev" as well.
Ack.
>>> An important way that this is different to md (and I think dm though I'm a
>>> bit foggy on the details there) is that with 'md' the first thing you create
>>> is the block device. Then you configure it and activate it and suddenly the
>>> block device has content. This doesn't fit at all well with udev and it
>>> isn't really udev's fault. It is md's fault.
>>
>> Could you dive deeper into that one, i.e., describe the sequence of
>> operations that bothers udev? As it happens, device mapper likes to
>> create targets separately from block devices. Maybe there is something
>> fundamentally right about that.
>
> When you create a block device with add_disk() it calls register_disk() which
> generate KOBJ_ADD which goes to udev. Udev thinks "oh, a new block device,
> better see what is inside it" etc. But for both dm and md the "inside"
> doesn't exist yet. A little while later when things are set up a KOBJ_CHANGE
> event has to be generated so udev goes looking again.
> This works (because we had to make it work) but is rather ugly.
> It would be much better if the data was available the moment that KOBJ_ADD was
> generated.
>
> KOBJ_CHANGE is really for devices with removable media. Modelling logical
> volume management as "removable media" doesn't seem right.
OK, thanks. Yes, certainly a flaw and a correctable one.
> You might be able to create dm-targets without the block device (I'm not
> sure), but you cannot create the dm-table.
> "disk_add()" is called when a "mapped_device" is created, and to load a
> table, you need to have a mapped_device.
I seem to recall that the way lvm2 resolves this is to keep the device
suspended until the table load is done. I would far rather that the
device not become visible until it is ready to act like a device. Maybe
Alasdair can comment, CC added.
> I would prefer the table/target/volume/mddev were all assembled first, then
> in a single atomic operation a gendisk linked to that thing were given to
> add_disk to create the block device and make it visible to udev and everybody
> else.
Right, 20 20 hindsight.
Well, KS is already providing a useful forum to kick some ideas around
and you are involved whether attending or not. I really think we need to
build some momentum on this earlier rather than later, and keep that
going through the other conferences. It doesn't really matter whether
absolutely every interested party is there on the ground, what matters
is whether a critical mass are present, which seems more than likely.
Regards,
Daniel
More information about the Ksummit-2013-discuss
mailing list