[PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

Paul Menage menage at google.com
Wed Jul 15 09:18:33 PDT 2009

On Wed, Jul 15, 2009 at 1:33 AM, Eric W. Biederman<ebiederm at xmission.com> wrote:
> I think guaranteeing a truly atomic snapshot is likely to be a
> horrible idea requiring all kinds of nasty locking,

We don't guarantee a truly atomic snapshot unless you manage to read
the entire file with a single read() call.

> and smp
> scalability issues.  So please walk the list of pids and
> just return those that belong to your cgroup.

The downside with that is that scanning any cgroup takes O(n) in the
number of threads on the machine, so scanning them all becomes O(n^2).
We've definitely seen problems (on older kernels using cpusets, which
did something similar, i.e. walking the tasklist) where we have lots
of small cpusets and a few huge ones, and this blew up the cost of
accessing any of them.

But having said that, the idea of being able to maintain just a cursor
is something that would definitely be nice.

Here's another idea that might work:

Currently, each cgroup has a list running through the attached css_set
objects, and each css_set has a list running through its tasks; we
iterate through these lists of lists to produce the cgroup's task list

Since this list isn't sorted in any way, there's no convenient way to
save/restore your position between seq_file invocations; this is why
we currently generate a sorted snapshot, so that even if the snapshot
is updated by someone else before our next read, we know where to pick
up from (the next pid above the last one that we returned).

Instead, we could actually store cursor objects in the list itself
whenever we need to pause/resume iterating through a large cgroup (due
to hitting the limits of a single seq_file read, i.e. probably after
every 700 threads).

Then we'd just need to teach cgroup_iter_next() to distinguish between
real tasks and cursors, and skip the latter. Simple way to do that
would be to change the existing declarations in task_struct:

        /* Control Group info protected by css_set_lock */
        struct css_set *cgroups;
        /* cg_list protected by css_set_lock and tsk->alloc_lock */
        struct list_head cg_list;

and instead define these two fields together as a struct
cgroup_css_list_elem. A cursor can just be a cgroup_css_list_elem
whose cgroups field points to a distinguishing address that identifies
it as a cursor.

So we're guaranteed to hit all threads that are in the cgroup before
we start, and stay there until we finish; there are no guarantees
about threads that move into or out of the cgroup while we're

It's a bit more complex than just iterating over the machine's entire
pid list, but I think it scales better.


More information about the Containers mailing list