dm-ioband + bio-cgroup benchmarks

Fri Sep 26 09:11:25 PDT 2008

Vivek Goyal wrote:
[snip]
> Ok, I will give more details of the thought process.
> 
> I was thinking of maintaing an rb-tree per request queue and not an
> rb-tree per cgroup. This tree can contain all the bios submitted to that
> request queue through __make_request(). Every node in the tree will represent
> one cgroup and will contain a list of bios issued from the tasks from that
> cgroup.
> 
> Every bio entering the request queue through __make_request() function
> first will be queued in one of the nodes in this rb-tree, depending on which
> cgroup that bio belongs to.
> 
> Once the bios are buffered in rb-tree, we release these to underlying
> elevator depending on the proportionate weight of the nodes/cgroups.
> 
> Some more details which I was trying to implement yesterday.
> 
> There will be one bio_cgroup object per cgroup. This object will contain
> many bio_group objects. Each bio_group object will be created for each
> request queue where a bio from bio_cgroup is queued. Essentially the idea
> is that bios belonging to a cgroup can be on various request queues in the
> system. So a single object can not serve the purpose as it can not be on
> many rb-trees at the same time.  Hence create one sub object which will keep
> track of bios belonging to one cgroup on a particular request queue.
> 
> Each bio_group will contain a list of bios and this bio_group object will
> be a node in the rb-tree of request queue. For example. Lets say there are
> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
> for /dev/sda and /dev/sdb.
> 
> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
> bios issued by task t1 for /dev/sdb. I thought the same can be extended
> for stacked devices also.
>   
> I am still trying to implementing it and hopefully this is doable idea.
> I think at the end of the day it will be something very close to dm-ioband
> algorithm just that there will be no lvm driver and no notion of separate
> dm-ioband device. 

Vivek, thanks for the detailed explanation. Only a comment. I guess, if
we don't change also the per-process optimizations/improvements made by
some IO scheduler, I think we can have undesirable behaviours.

For example: CFQ uses the per-process iocontext to improve fairness
between *all* the processes in a system. But it doesn't have the concept
that there's a cgroup context on-top-of the processes.

So, some optimizations made to guarantee fairness among processes could
conflict with algorithms implemented at the cgroup layer. And
potentially lead to undesirable behaviours.

For example an issue I'm experiencing with my cgroup-io-throttle
patchset is that a cgroup can consistently increase the IO rate (always
respecting the max limits), simply increasing the number of IO worker
tasks respect to another cgroup with a lower number of IO workers. This
is probably due to the fact the CFQ tries to give the same amount of
"IO time" to all the tasks, without considering that they're organized
in cgroup.

I don't see this behaviour with noop or deadline, because they don't
have the concept of iocontext.

-Andrea