No subject

Thu Feb 7 16:58:13 UTC 2013

deserves a bit more focus, and by "useable to userland", I don't mean
some group hacking up an elaborate, manual configuration which is
tailored to the point of being eccentric to suit the needs of the said
group.  There's nothing wrong with that and they can continue to do
so, but it just isn't generically useable or useful.  It should be
possible to generically and automatically split resources among, say,
several servers and a couple users sharing a system without resorting
to indecipherable ad-hoc shell script running off rc.local.

 Userland efforts
 ================

There are currently a few userland efforts trying to make interfacing
with cgroup less painful.

* libcg: Make cgroup interface accessible from programming languages
  with support for configuration persistency, which also brings its
  own config files to remember what to do on the next boot.  Sans the
  persistence part, it just seems to directly translate the filesystem
  interface to function interface.

  http://libcg.sourceforge.net/

* Workman: It's a rather young project but as its name (workload
  management) implies, its aims are higher level than that of libcg.
  It aims to provide high-level resource allocation and management and
  introduces new concepts like resource partitions to represent its
  view of resource hierarchy.  Like libcg, this one is implemented as
  a library but provides bindings for more languages.

  https://gitorious.org/workman/pages/Home

* Pax Controla Groupiana: A document on how not to step on other's
  toes while using cgroup.  It's not a software project but tries to
  define precautions that a software or user can take to avoid
  breaking or confusing other users of the cgroup filesystem.

  http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

All try to play nice with other possible users of the cgroup
filesystem - be it libvirt cgroup, applications doing their own cgroup
tricks, or hand-crafted custom scripts.  While the approach is
understandable given that those usages already exist, I don't think
it's a workable solution in the long term.  There are several reasons
for that.

* The configurations aren't independent.  e.g. for weight-based
  controllers, your weight is only meaningful in relation to other
  weights at that level.  Distributing configuration to whatever
  entities which may write to cgroupfs simply cannot work.  It's
  fundamentally flawed.

* It's fragile like hell.  There's no accountability.  Nobody really
  knows what's going on.  Is this subdirectory still there due to a
  bug in this program, or something or someone else created it and
  crashed / forgot to remove it, or what?  Oh, the cgroup I wanted to
  create already exists.  Maybe the previous instance created it and
  then crashed or maybe some other program just happened to choose the
  same name.  Who owns config knobs in that directory?  This way lies
  madness.  I understand why the Pax doc exists but I'm not sure its
  long-term effect would be positive - best practices which ultimately
  lead to utter confusion and fragility.

* In many cases, resource distribution is system-wide policy decisions
  and determining what to do often requires system-wide knowledge.
  You can't provision memory limits without knowing what's available
  in the system and what else is going on in the system, and you want
  to be able to adjust them as situation and configuration changes.
  Without anybody having full picture of how resources are
  provisioned, how would any of that be possible?

I think this anything-goes approach is prevalent largely because the
cgroup filesystem interface encourages such usage.  From the looks of
it, the filesystem permissions combined with hierarchy should be able
to handle delegation perfectly.  Well, as it currently stands, it's
anything but and the interface is just misleading.  Hierarchy support
was an utter mess, configuration schemes aren't uniform across
controllers, and, more fundamentally, hierarchy itself is expensive -
we can't delegate hierarchy creation to unpriviledged users or
programs safely.

It is in the realm of possibility to make all cgroup operations and
controllers to do all that; however, it's a very tall order.  Just
think about how much effort it has been to achieve and maintain proper
delegation in the core elements of the kernel - processes and
filesystems, and there will be security implications with cgroup
likely involving a lot of gotchas and extensions of security
infrastructures, and, even then, I'm pretty sure it's gonna require
helps from userland to effect proper policy decisions and config
changes.  We have things like polkit for a reason and are likely to
need finer-grained, domain-aware access control than is possible with
tweaking directory permissions.

Given the above and how relatively marginal cgroup is, I'm extremely
skeptical that implementing full delegation in kernel is the right
course of action and likely to scream like a banshee at any attempt
driving things that way.

I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure, represents available resources
in a sane form, and makes policy decisions based on configuration and
requests.  I don't have a concerete idea what that authority should be
like, but I think there already are pretty similar facilities in our
userland, and don't see why this should be much different.

Another reason why this could be helpful is that we're gonna be
morphing towards unified hierarchy and it'd very nice to have
something which can match impedance between the old and new ways and
not require each individual consumer of cgroup to handle such changes.
As for the unified hierarchy, we just have to.  It's currently
fundamentally broken in that it's impossible to tell which cgroup a
resource belongs to independent of which task is looking at it.  It's
like this damn thing is designed to honor Hisenberg and Einstein.  No
disrespect for the great minds, but it just doens't look like the
proper place.

Even apart from the unified hierarchy thing, I think it generally is a
good idea to have a buffer layer between the kernel interface and
individual consumers for cgroup, which is still very immature and
kinda tightly coupled with internal implementation details.

So, umm, that's what I want.  When I first heard of WorkMan, I was
excited thinking maybe the universe is being really nice and making
things happen to my wishes without me actually doing anything. :) Oh
well, one can dream, but everything is still early, so hopefully we
have enough time to figure things out.

What do you guys think?

Thanks.

--
tejun