[PATCH -mm 0/6] cgroup: block device i/o controller (v10)

Andrea Righi righi.andrea at gmail.com
Wed Sep 17 04:05:22 PDT 2008

The objective of the i/o controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth and/or iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the

Tested against 2.6.27-rc5-mm1.

The all-in-one patch (and previous versions) can be found at:

Changelog: (v9 -> v10)
* fix a bug to correctly throttle small direct-IO writes
* fix: do not add a new limiting rule if the limit is 0 (unlimited)
* do not report time values directly in jiffies, always use clock_t
* remove a spinlock in struct iothrottle (we always hold cgroup_lock() when
  using it for RCU update, so an additional spinlock is not needed)
* use page_cgroup functionality provided by memory cgroup controller to charge
  the right cgroup of asynchronous i/o activity (e.g. pdflush writebacks)
* code simplification in cgroup_io_throttle()
* removed a lot of experimental stuff introduced in the previous version
* update documentation

* Implement a rbtree per request queue; all the requests queued to the I/O
  subsystem first will go in this rbtree. Then based on cgroup grouping and
  control policy dispatch the requests and pass them to the elevator associated
  with the queue. This would allow to provide both bandwidth limiting and
  proportional bandwidth functionalities using a quite generic approach
  (suggested by Vivek Goyal)

* Improve fair throttling: distribute the time to sleep among all the tasks of
  a cgroup that exceeded the I/O limits, depending of the amount of IO activity
  previously generated in the past by each task (see task_io_accounting)

* Try to reduce the cost of calling cgroup_io_throttle() on every submit_bio();
  this is not too much expensive, but the call of task_subsys_state() has
  surely a cost. A possible solution could be to temporarily account I/O in the
  current task_struct and call cgroup_io_throttle() only on each X MB of I/O.
  Or on each Y number of I/O requests as well. Better if both X and/or Y can be
  tuned at runtime by a userspace tool

* Think an alternative design for general purpose usage; special purpose usage
  right now is restricted to improve I/O performance predictability and
  evaluate more precise response timings for applications doing I/O. To a large
  degree the block I/O bandwidth controller should implement a more complex
  logic to better evaluate real I/O operations cost, depending also on the
  particular block device profile (i.e. USB stick, optical drive, hard disk,
  etc.). This would also allow to appropriately account I/O cost for seeky
  workloads, respect to large stream workloads. Instead of looking at the
  request stream and try to predict how expensive the I/O cost will be, a
  totally different approach could be to collect request timings (start time /
  elapsed time) and based on collected informations, try to estimate the I/O
  cost and usage


More information about the Containers mailing list