[RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)

Wed Aug 27 09:07:32 PDT 2008

The objective of the i/o controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth and/or iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the
documentation.

Patchset against 2.6.27-rc1-mm1.

The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

This patchset is an experimental implementation, it includes functional
differences respect to the previous versions (see the changelog below), and I
haven't done much testing yet. So, comments are really welcome.

Changelog: (v8 -> v9)

* introduce struct res_counter_ratelimit as a generic structure to implement
  throttling-based cgroup subsystems
* removed the throttling hooks from the page cache (set_page_dirty): set a
  single throttling hook in submit_bio() both for read and write operations; a
  generic process that is dirtying pages on a limited block device (for the
  cgroup it belongs to) is forced to flush the same amount of pages back to the
  block device (in this way write operations are forced to occur in the same IO
  context of the process that actually generated the IO)
* collect per cgroup, block device and task throttling statistics (throttle
  counter and total time slept for throttling) and export them to userspace
  through blockio.throttlcnt (in the cgroup filesystem) and
  /proc/PID/io-throttle-stat (per-task statistics)
* fair throttling: simple attempt to distribute the sleeps equally among all
  the tasks belonging to the same cgroup; instead of imposing a sleep to the
  first task that exceeds the IO limits, the time to sleep is divided by the
  number of tasks present in the same cgroup

TODO:

* Try to push down the throttling and implement it directly in the I/O
  schedulers, using bio-cgroup (http://people.valinux.co.jp/~ryov/bio-cgroup/)
  to keep track of the right cgroup context. This approach could lead to more
  memory consumption and increases the number of dirty pages (hard/slow to
  reclaim pages) in the system, since dirty-page ratio in memory is not
  limited. This could even lead to potential OOM conditions, but these problems
  can be resolved directly into the memory cgroup subsystem

* Handle I/O generated by kswapd: at the moment there's no control on the I/O
  generated by kswapd; try to use the page_cgroup functionality of the memory
  cgroup controller to track this kind of I/O and charge the right cgroup when
  pages are swapped in/out

* Improve fair throttling: distribute the time to sleep among all the tasks of
  a cgroup that exceeded the I/O limits, depending of the amount of IO activity
  generated in the past by each task (see task_io_accounting)

* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
  this is not too much expensive, but the call of task_subsys_state() has
  surely a cost. A possible solution could be to temporarily account I/O in the
  current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
  Or on each Y number of I/O requests as well. Better if both X and/or Y can be
  tuned at runtime by a userspace tool

* Think an alternative design for general purpose usage; special purpose usage
  right now is restricted to improve I/O performance predictability and
  evaluate more precise response timings for applications doing I/O. To a large
  degree the block I/O bandwidth controller should implement a more complex
  logic to better evaluate real I/O operations cost, depending also on the
  particular block device profile (i.e. USB stick, optical drive, hard disk,
  etc.). This would also allow to appropriately account I/O cost for seeky
  workloads, respect to large stream workloads. Instead of looking at the
  request stream and try to predict how expensive the I/O cost will be, a
  totally different approach could be to collect request timings (start time /
  elapsed time) and based on collected informations, try to estimate the I/O
  cost and usage

-Andrea