[patch 2/2] sched: fix nr_uninterruptible accounting of frozen tasks really

Matt Helsley matthltc at us.ibm.com
Fri Jul 17 08:22:35 PDT 2009

On Fri, Jul 17, 2009 at 02:31:50PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-17 at 12:25 +0000, Thomas Gleixner wrote:
> > plain text document attachment (freezer-fix-accounting-for-real.patch)
> > commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
> > the nr_uninterruptible accounting on freeze/thaw. On freeze the task
> > is excluded from accounting with a check for (task->flags &
> > PF_FROZEN), but that flag is cleared before the task is thawed. So
> > while we prevent that the freezing task with state
> > TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
> > nr_uninterruptible on thaw.
> > 
> > Use a separate flag which is handled by the freezing task itself. Set
> > it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
> > clear it after we return from frozen state.
> Right, so I'm wondering why we don't fully revert e3c8ca8336 to begin
> with.
> The changelog reads:
> ---
> commit e3c8ca8336707062f3f7cb1cd7e6b3c753baccdd
> Author: Nathan Lynch <ntl at pobox.com>
> Date:   Wed Apr 8 19:45:12 2009 -0500
>     sched: do not count frozen tasks toward load
>     Freezing tasks via the cgroup freezer causes the load average to climb
>     because the freezer's current implementation puts frozen tasks in
>     uninterruptible sleep (D state).
>     Some applications which perform job-scheduling functions consult the
>     load average when making decisions.  If a cgroup is frozen, the load
>     average does not provide a useful measure of the system's utilization
>     to such applications.  This is especially inconvenient if the job
>     scheduler employs the cgroup freezer as a mechanism for preempting low
>     priority jobs.  Contrast this with using SIGSTOP for the same purpose:
>     the stopped tasks do not count toward system load.
>     Change task_contributes_to_load() to return false if the task is
>     frozen.  This results in /proc/loadavg behavior that better meets
>     users' expectations.
> ---
> It appears to me that a frozen cgroup is a transient state. Either you
> would typically do something like:
>   freeze -> {snapshot, migrate} -> {thaw, destroy}
> Therefore a short increase in load doesn't seem like too big a problem,
> its going to be gone soon anyway.
> Hmm?

The job scheduler in question does not use FROZEN as a transient state and
does not use checkpoint/restart at all since c/r is still a work in progress.
Even when used for power management it seems wrong to count frozen tasks
towards the loadavg since they aren't using CPU time or waiting for IO.

	-Matt Helsley

More information about the Containers mailing list