[RFC][PATCH -mm 1/5] i/o controller documentation

Thu Sep 18 09:26:51 PDT 2008

Vivek Goyal wrote:
> On Thu, Sep 18, 2008 at 05:03:59PM +0200, Andrea Righi wrote:
>> Vivek Goyal wrote:
>>> On Wed, Aug 27, 2008 at 06:07:33PM +0200, Andrea Righi wrote:
>>>> Documentation of the block device I/O controller: description, usage,
>>>> advantages and design.
>>>>
>>>> Signed-off-by: Andrea Righi <righi.andrea at gmail.com>
>>>> ---
>>>>  Documentation/controllers/io-throttle.txt |  377 +++++++++++++++++++++++++++++
>>>>  1 files changed, 377 insertions(+), 0 deletions(-)
>>>>  create mode 100644 Documentation/controllers/io-throttle.txt
>>>>
>>>> diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
>>>> new file mode 100644
>>>> index 0000000..09df0af
>>>> --- /dev/null
>>>> +++ b/Documentation/controllers/io-throttle.txt
>>>> @@ -0,0 +1,377 @@
>>>> +
>>>> +               Block device I/O bandwidth controller
>>>> +
>>>> +----------------------------------------------------------------------
>>>> +1. DESCRIPTION
>>>> +
>>>> +This controller allows to limit the I/O bandwidth of specific block devices for
>>>> +specific process containers (cgroups) imposing additional delays on I/O
>>>> +requests for those processes that exceed the limits defined in the control
>>>> +group filesystem.
>>>> +
>>>> +Bandwidth limiting rules offer better control over QoS with respect to priority
>>>> +or weight-based solutions that only give information about applications'
>>>> +relative performance requirements. Nevertheless, priority based solutions are
>>>> +affected by performance bursts, when only low-priority requests are submitted
>>>> +to a general purpose resource dispatcher.
>>>> +
>>>> +The goal of the I/O bandwidth controller is to improve performance
>>>> +predictability from the applications' point of view and provide performance
>>>> +isolation of different control groups sharing the same block devices.
>>>> +
>>>> +NOTE #1: If you're looking for a way to improve the overall throughput of the
>>>> +system probably you should use a different solution.
>>>> +
>>>> +NOTE #2: The current implementation does not guarantee minimum bandwidth
>>>> +levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
>>>> +limits specified by the user; minimum I/O rate thresholds are supposed to be
>>>> +guaranteed if the user configures a proper I/O bandwidth partitioning of the
>>>> +block devices shared among the different cgroups (theoretically if the sum of
>>>> +all the single limits defined for a block device doesn't exceed the total I/O
>>>> +bandwidth of that device).
>>>> +
>>> Hi Andrea,
>>>
>>> Had a query. What's your use case for capping max bandwidth? I was
>>> wondering will proportional bandwidth not cover it. So if we allocate
>>> weight/share to every cgroup and limit the bandwidth based on shares
>>> only in case of contention. Otherwise applications get to unlimited
>>> bandwidth. Much like what cpu controller does or for that matter dm-ioband
>>> seems to be doing the same thing. Will you not get same kind of QoS here when
>>> comapred to max-bandwidth. The only thing probably missing is what we call
>>> hard limit. When BW is available but you don't want a user to use that
>>> BW, until and unless user has paid for that.
>> At the beginning my use case was to guarantee a certain level
>> performance _predictability_. That means no more and no less than the
>> specified threshold (should I say this would be useful for the real-time
>> apps? maybe yes).
>>
> 
> Is "no more" harmful for real-time env? Which RT application hates more
> bandwidth than what one asked for? I could understand "no-less" but you
> mentioned in the past that implementing minimum gurantees is lot harder.

RT doesn't mean as fast as possible, the objective of RT is to meet the
individual timing requirement. So, the most important property for RT should
be predicatbility. If you know that an application would require exactly
T seconds to read a block from a device (no more, no less) well... in this
case you're not introducing uncertainness in your RT task.

And I agree for the "no-less" part. It's difficult, but there's surely
space for improvements.

> I was thinking that what if we continue to stick to the current policy
> of letting RT requests go first and try to let them use disk bw first.
> cfq first dispatches requests of RT class (based on their priority).
> So in simple implementation, IO controller will simply let all the RT class
> requests to go directly to elevator and then let elevator dispatch these
> requests based on their RT prio. IO-controller will only buffer and control
> requests of non-RT class. This will make sure that we don't break the case of
> existing working RT applications and still be able to divide remaining disk
> BW among other non-RT tasks.
> 
> IMHO, once above simple scheme is working, we can probably extend it to
> provide additional level of controls.
>  
> Thanks
> Vivek

Sounds reasonable, since we want to give more guarantees to respect minimum bw
requirements for RT tasks.

-Andrea