[patch 0/4] [RFC] Another proportional weight IO controller

Tue Nov 25 22:40:18 PST 2008

On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote:
> > The dm approach has some merrits, the major one being that it'll fit
> > directly into existing setups that use dm and can be controlled with
> > familiar tools. That is a bonus. The draw back is partially the same -
> > it'll require dm. So it's still not a fit-all approach, unfortunately.
> > 
> > So I'd prefer an approach that doesn't force you to use dm.
> 
> Hi Jens,
> 
> My patches met the goal of not using the dm for every device one wants
> to control.
> 
> Having said that, few things come to mind.
> 
> - In what cases do we need to control the higher level logical devices
>   like dm. It looks like real contention for resources is at leaf nodes.
>   Hence any kind of resource management/fair queueing should probably be
>   done at leaf nodes and not at higher level logical nodes.

The problem with stacking devices is that we do not know how the IO
going through the leaf nodes contributes to the aggregate throughput
seen by the application/cgroup that generated it, which is what end
users care about.

The block device could be a plain old sata device, a loop device, a
stacking device, a SSD, you name it, but their topologies and the fact
that some of them do not even use an elevator should be transparent to
the user.

If you wanted to do resource management at the leaf nodes some kind of
topology information should be passed down to the elevators controlling
the underlying devices, which in turn would need to work cooperatively.

>   If that makes sense, then probably we don't need to control dm device
>   and we don't need such higher level solutions.

For the reasons stated above the two level scheduling approach seems
cleaner to me.

> - Any kind of 2 level scheduler solution has the potential to break the
>   underlying IO scheduler. Higher level solution requires buffering of
>   bios and controlled release of bios to lower layers. This control breaks
>   the assumptions of lower layer IO scheduler which knows in what order
>   bios should be dispatched to device to meet the semantics exported by
>   the IO scheduler.

Please notice that the such an IO controller would only get in the way
of the elevator in case of contention for the device. What is more,
depending on the workload it turns out that buffering at higher layers
in a per-cgroup or per-task basis, like dm-band does, may actually
increase the aggregate throughput (I think that the dm-band team
observed this behavior too). The reason seems to be that bios buffered
in such way tend to be highly correlated and thus very likely to get
merged when released to the elevator.

> - 2nd level scheduler does not keep track of tasks but task groups lets
>   every group dispatch fair share. This has got little semantic problem in
>   the sense that tasks and groups in root cgroup will not be considered at
>   same level. "root" will be considered one group at same level with all
>   child group hence competing with them for resources.
> 
>   This looks little odd. Considering tasks and groups same level kind of
>   makes more sense. cpu scheduler also consideres tasks and groups at same
>   level and deviation from that probably is not very good.
> 
>   Considering tasks and groups at same level will matter only if IO
>   scheduler maintains separate queue for the task, like CFQ. Because
>   in that case IO scheduler tries to provide fairness among various task
>   queues. Some schedulers like noop don't have any notion of separate
>   task queues and fairness among them. In that case probably we don't
>   have a choice but to assume root group competing with child groups.

If deemed necessary this case could be handled too, but it does not look
like a show-stopper.

> Keeping above points in mind, probably two level scheduling is not a
> very good idea. If putting the code in a particular IO scheduler is a
> concern we can probably explore ways regarding how we can maximize the
> sharing of cgroup code among IO schedulers.

As discussed above, I still think that the two level scheduling approach
makes more sense. Regarding the sharing of cgroup code among IO
schedulers I am all for it. If we consider that elevators should only
care about maximizing usage of the underlying devices, implementing
other non-hardware-dependent scheduling disciplines (that prioritize
according to the task or cgroup that generated the IO, for example) at
higher layers so that we can reuse code makes a lot of sense.

Thanks,

Fernando