[PATCH 6/6] Makes procs file writable to move all threads by tgid at once

Peter Zijlstra a.p.zijlstra at chello.nl
Thu Aug 6 04:02:20 PDT 2009

On Thu, 2009-08-06 at 03:42 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 3:34 AM, Peter Zijlstra<a.p.zijlstra at chello.nl> wrote:
> > On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:
> >
> >> OK, well if lockdep can't currently handle the "writer takes a lock on
> >> every thread" model,
> >
> > I haven't read what this is about, but simply looking at that sentence
> > makes me want to hit someone with a cluebat. Have you any idea how
> > expensive that is?
> For lockdep to track that many locks, or just the concept of taking
> that many locks generally?

Taking that many locks in general, some apps (JVM based usually) tend to
be thread heavy and can easily have hundreds of them, even on relatively
small (~8 CPU) machines. Having to take hundreds of locks on such a
small machine seems to indicate to me you're doing something wrong.

[ I now understand its only 'all threads of a process' which is
  slightly better than 'all threads' but still crazy ]

> The basic idea is that in order to implement a "procs" file in cgroups
> that can migrate all threads in a process atomically, we need to
> synchronize with concurrent clone() calls. But since thread clones are
> likely to occur far more often than "procs" writes, and we wanted to
> avoid introducing overhead into the clone path, one approach was to
> give each thread a fork mutex, which it could take around the relevant
> parts of the fork/clone operation, and have the "procs" writer deal
> with obtaining the fork mutex for every thread in the process being
> moved, so pushing the overhead on to the "procs" writer.

See below.

> I don't think it's a deficiency of lockdep that it would have trouble
> dealing with this - in fact, my original plan was that we'd just have
> to live with the fact that anyone doing a "procs" move on a massive
> process would have to live with lockdep printing an overflow warning.

Now that's not real nice is it ;-)

> But given that AFAICS we can eliminate the overhead associated with a
> single lock by piggy-backing on the cache line containing
> sighand->count, hopefully this won't be an issue any more.

Right, so this is a write rarely, read frequently thing, which suggests
an RCU like approach where the readers pay a minimum synchronization

Then again, clone() might already serialize on the process as a whole
(not sure though, Oleg/Ingo?), in which case you can indeed take a
process wide lock.

If not, you might think about using an SRCU variant where the read
(clone) side does something like:

 if (rcu_derference(tsk->sighand->cgroup_block))
   wait_event(&some_queue, !tsk->sighand->cgroup_block);


 /* do the clone bit */


and the write (cgroup thingy) side does:

 rcu_assign(tsk->sighand->cgroup_block, 1);


 /* everybody will be out of clone */

 /* do your thing */

 rcu_assign(tsk->sighand->cgroup_block, 0);

Or something like that, which should reduce the read-side overhead to a
single shared read and some SRCU ops which should be per-cpu in the
normal case.

More information about the Containers mailing list