dm-ioband + bio-cgroup benchmarks

Wed Sep 24 07:03:55 PDT 2008

On Wed, Sep 24, 2008 at 05:29:37PM +0900, Hirokazu Takahashi wrote:
> Hi,
> 
> > > > > > > I have got excellent results of dm-ioband, that controls the disk I/O
> > > > > > > bandwidth even when it accepts delayed write requests.
> > > > > > > 
> > > > > > > In this time, I ran some benchmarks with a high-end storage. The
> > > > > > > reason was to avoid a performance bottleneck due to mechanical factors
> > > > > > > such as seek time.
> > > > > > > 
> > > > > > > You can see the details of the benchmarks at:
> > > > > > > http://people.valinux.co.jp/~ryov/dm-ioband/hps/
> > > > > 
> > > > >   (snip)
> > > > > 
> > > > > > Secondly, why do we have to create an additional dm-ioband device for 
> > > > > > every device we want to control using rules. This looks little odd
> > > > > > atleast to me. Can't we keep it in line with rest of the controllers
> > > > > > where task grouping takes place using cgroup and rules are specified in
> > > > > > cgroup itself (The way Andrea Righi does for io-throttling patches)?
> > > > > 
> > > > > It isn't essential dm-band is implemented as one of the device-mappers.
> > > > > I've been also considering that this algorithm itself can be implemented
> > > > > in the block layer directly.
> > > > > 
> > > > > Although, the current implementation has merits. It is flexible.
> > > > >   - Dm-ioband can be place anywhere you like, which may be right before
> > > > >     the I/O schedulers or may be placed on top of LVM devices.
> > > > 
> > > > Hi,
> > > > 
> > > > An rb-tree per request queue also should be able to give us this
> > > > flexibility. Because logic is implemented per request queue, rules can be 
> > > > placed at any layer. Either at bottom most layer where requests are
> > > > passed to elevator or at higher layer where requests will be passed to 
> > > > lower level block devices in the stack. Just that we shall have to do
> > > > modifications to some of the higher level dm/md drivers to make use of
> > > > queuing cgroup requests and releasing cgroup requests to lower layers.
> > > 
> > > Request descriptors are allocated just right before passing I/O requests
> > > to the elevators. Even if you move the descriptor allocation point
> > > before calling the dm/md drivers, the drivers can't make use of them.
> > > 
> > 
> > You are right. request descriptors are currently allocated at bottom
> > most layer. Anyway, in the rb-tree, we put bio cgroups as logical elements
> > and every bio cgroup then contains the list of either bios or requeust
> > descriptors. So what kind of list bio-cgroup maintains can depend on
> > whether it is a higher layer driver (will maintain bios) or a lower layer
> > driver (will maintain list of request descriptors per bio-cgroup).
> 
> I'm getting confused about your idea.
> 
> I thought you wanted to make each cgroup have its own rb-tree,
> and wanted to make all the layers share the same rb-tree.
> If so, are you going to put different things into the same tree?
> Do you even want all the I/O schedlers use the same tree?
> 

Ok, I will give more details of the thought process.

I was thinking of maintaing an rb-tree per request queue and not an
rb-tree per cgroup. This tree can contain all the bios submitted to that
request queue through __make_request(). Every node in the tree will represent
one cgroup and will contain a list of bios issued from the tasks from that
cgroup.

Every bio entering the request queue through __make_request() function
first will be queued in one of the nodes in this rb-tree, depending on which
cgroup that bio belongs to.

Once the bios are buffered in rb-tree, we release these to underlying
elevator depending on the proportionate weight of the nodes/cgroups.

Some more details which I was trying to implement yesterday.

There will be one bio_cgroup object per cgroup. This object will contain
many bio_group objects. Each bio_group object will be created for each
request queue where a bio from bio_cgroup is queued. Essentially the idea
is that bios belonging to a cgroup can be on various request queues in the
system. So a single object can not serve the purpose as it can not be on
many rb-trees at the same time.  Hence create one sub object which will keep
track of bios belonging to one cgroup on a particular request queue.

Each bio_group will contain a list of bios and this bio_group object will
be a node in the rb-tree of request queue. For example. Lets say there are
two request queues in the system q1 and q2 (lets say they belong to /dev/sda
and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
for /dev/sda and /dev/sdb.

bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
bios issued by task t1 for /dev/sdb. I thought the same can be extended
for stacked devices also.

I am still trying to implementing it and hopefully this is doable idea.
I think at the end of the day it will be something very close to dm-ioband
algorithm just that there will be no lvm driver and no notion of separate
dm-ioband device. 

> Are you going to block request descriptors in the tree?
> >From the view point of performance, all the request descriptors
> should be passed to the I/O schedulers, since the maximum number
> of request descriptors is limited. 
> 

In my initial implementation I was queuing the request descriptors. Then
you mentioned that it is not a good idea because potentially a cgroup
issuing more requests might win the race.

Yesterday night I thought, then why not start queuing the bios as they
are submitted to the request_queue, using __make_request() and then
release these to underlying elevator or underlying request queue (in case
of stacked device). This will remove few issues.

- All the layers can uniformly queue bios and no intermixing of queuing
 bios and request descriptors.

- Will get rid of issue of one cgroup winning the race because of limited
  number of request descriptors.

> And I still don't understand if you want to make your rb-tree
> work efficiently, you need to put a lot of bios or request descriptors
> into the tree. Is that what you are going to do?
> On the other hand, dm-ioband tries to minimize to have bios blocked.
> And I have a plan on reducing the maximum number that can be
> blocked there.
> 

Now I am planning to queue bios and probably there is no need to queue
request descriptors. I think that's what dm-ioband is doing. Queueing
bios for cgroups per io-band device.

Thinking more about it, In dm-ioband case, you seem to be buffering bios
from various cgroups on a separate request queue belonging to dm-ioband
device. I was thinking of moving all that buffering logic to existing
request queues instead of creating another request queue on top of request
queue I want to control (dm-ioband device).

> Sorry to bother you that I just don't understand the concept clearly.
> 
> > So basically mechanism of maintaining an rb-tree can be completely
> > ignorant of the fact whether a driver is keeping track of bios or keeping
> > track of requests per cgroup. 
> 
> I don't care whether the queue is implemented as a rb-tee or some
> kind of list because they are logically the same thing.

That's true. rb-tree or list is just data structure detail. It is not
important. The core thing I am trying to achive is that is there a way that
I can get rid of notion of creating a separate dm-ioband device for every
device I want to control.

Is it just me who finds creation of dm-ioband devices odd and difficult to
manage or there are other people who think that it would be nice if we can get
rid of it?

Thanks
Vivek