[RFC] writeback and cgroup

Sat Apr 14 14:36:39 UTC 2012

On Thu, Apr 12, 2012 at 01:51:48PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, Apr 12, 2012 at 04:37:19PM -0400, Vivek Goyal wrote:
> > I mean how are we supposed to put cgroup throttling rules using cgroup
> > interface for network filesystems and for btrfs global bdi. Using "dev_t"
> > associated with bdi? I see that all the bdi's are showing up in
> > /sys/class/bdi, but how do I know which one I am intereste in or which
> > one belongs to filesystem I am interestd in putting throttling rule on.
> > 
> > For block devices, we simply use "major:min limit" format to write to
> > a cgroup file and this configuration will sit in one of the per queue
> > per cgroup data structure.
> > 
> > I am assuming that when you say throttling should happen at bdi, you
> > are thinking of maintaining per cgroup per bdi data structures and user
> > is somehow supposed to pass "bdi_maj:bdi_min  limit" through cgroup files?
> > If yes, how does one map a filesystem's bdi we want to put rules on?
> 
> I think you're worrying way too much.  One of the biggest reasons we
> have layers and abstractions is to avoid worrying about everything
> from everywhere.  Let block device implement per-device limits.  Let
> writeback work from the backpressure it gets from the relevant IO
> channel, bdi-cgroup combination in this case.
> 
> For stacked or combined devices, let the combining layer deal with
> piping the congestion information.  If it's per-file split, the
> combined bdi can simply forward information from the matching
> underlying device.  If the file is striped / duplicated somehow, the
> *only* layer which knows what to do is and should be the layer
> performing the striping and duplication.  There's no need to worry
> about it from blkcg and if you get the layering correct it isn't
> difficult to slice such logic inbetween.  In fact, most of it
> (backpressure propagation) would just happen as part of the usual
> buffering between layers.

Yeah the backpressure idea would work nicely with all possible
intermediate stacking between the bdi and leaf devices. In my attempt
to do combined IO bandwidth control for

- buffered writes, in balance_dirty_pages()
- direct IO, in the cfq IO scheduler

I have to look into the cfq code in the past days to get an idea how
the two throttling layers can cooperate (and suffer from the pains
arise from the violations of layers). It's also rather tricky to get
two previously independent throttling mechanisms to work seamlessly
with each other for providing the desired _unified_ user interface. It
took a lot of reasoning and experiments to work the basic scheme out...

But here is the first result. The attached graph shows progress of 4
tasks:
- cgroup A: 1 direct dd + 1 buffered dd
- cgroup B: 1 direct dd + 1 buffered dd

The 4 tasks are mostly progressing at the same pace. The top 2
smoother lines are for the buffered dirtiers. The bottom 2 lines are
for the direct writers. As you may notice, the two direct writers are
somehow stalled for 1-2 times, which increases the gaps between the
lines. Otherwise, the algorithm is working as expected to distribute
the bandwidth to each task.

The current code's target is to satisfy the more realistic user demand
of distributing bandwidth equally to each cgroup, and inside each
cgroup, distribute bandwidth equally to buffered/direct writes. On top
of which, weights can be specified to change the default distribution.

The implementation involves adding "weight for direct IO" to the cfq
groups and "weight for buffered writes" to the root cgroup. Note that
current cfq proportional IO conroller does not offer explicit control
over the direct:buffered ratio.

When there are both direct/buffered writers in the cgroup,
balance_dirty_pages() will kick in and adjust the weights for cfq to
execute. Note that cfq will continue to send all flusher IOs to the
root cgroup.  balance_dirty_pages() will compute the overall async
weight for it so that in the above test case, the computed weights
will be

- 1000 async weight for the root cgroup (2 buffered dds)
- 500 dio weight for cgroup A (1 direct dd)
- 500 dio weight for cgroup B (1 direct dd)

The second graph shows result for another test case:
- cgroup A, weight 300: 1 buffered cp
- cgroup B, weight 600: 1 buffered dd + 1 direct dd
- cgroup C, weight 300: 1 direct dd
which is also working as expected.

Once the cfq properly grants total async IO share to the flusher,
balance_dirty_pages() will then do its original job of distributing
the buffered write bandwidth among the buffered dd tasks.

It will have to assume that the devices under the same bdi are
"symmetry". It also needs further stats feedback on IOPS or disk time
in order to do IOPS/time based IO distribution. Anyway it would be
interesting to see how far this scheme can go. I'll cleanup the code
and post it soon.

Thanks,
Fengguang
-------------- next part --------------
A non-text attachment was scrubbed...
Name: balance_dirty_pages-task-bw.png
Type: image/png
Size: 72619 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20120414/242400bf/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: balance_dirty_pages-task-bw.png
Type: image/png
Size: 69646 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20120414/242400bf/attachment-0003.png>