memcg creates an unkillable task in 3.11-rc2

Thu Sep 5 09:56:53 UTC 2013

It seems that this one fell though the cracks?

On Thu 01-08-13 11:06:20, Michal Hocko wrote:
> On Wed 31-07-13 15:09:16, Eric W. Biederman wrote:
> > Michal Hocko <mhocko at suse.cz> writes:
> > 
> > > [I am CCing David here as well]
> > >
> > > On Tue 30-07-13 09:37:46, Eric W. Biederman wrote:
> > >> Michal Hocko <mhocko at suse.cz> writes:
> > >> 
> > >> > On Tue 30-07-13 01:19:31, Eric W. Biederman wrote:
> > >> > [...]
> > >> >> Hmm. Looking farther I see what is going on. And it has nothing to do
> > >> >> with the freezer. (I have commented out that code and reproduced it
> > >> >> without the freezer to be doubly certain).
> > >> >> 
> > >> >> 
> > >> >> On the exit path exit_robust_list is triggering a page fault to fault a
> > >> >> page back in.  Which since we have no memory causes the exit path
> > >> >> to get stuck in mem_cgroup_handle_oom.
> > >> >
> > >> > Hmm, interesting. I assume the exit is caused by the SIGKILL, right?
> > >> > If yes, then why it hasn't coughed early in __mem_cgroup_try_charge
> > >> 
> > >> Interesting question.  This isn't the primary thread but we do send
> > >> SIGKILL to the secondary threads as well.
> > >> 
> > >> We definitely need those checks on both paths making my change valid.
> > >> 
> > >> Oh. Duh!  This is after we act on SIGKILL so SIGKILL is no longer
> > >> pending.
> > >
> > > Very well spotted Eric! What do you think about the following patch?
> > > I would have to check since when the exit path could trigger the fault
> > > but I guess this is worth stable backport.
> > 
> > It doesn't have a prayer of working.
> 
> So it hasn't passed your test?
> 
> > You leave open the race of a fatal signal being received before we go to
> > sleep.
> 
> If a fatal signal is received before we're going to sleep then
> schedule() should keep it on the runqueue, no?
> 
> static void __sched __schedule(void)
> {
> [...]
>         if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>                 if (unlikely(signal_pending_state(prev->state, prev))) {
>                         prev->state = TASK_RUNNING;
>                 } else {
> 
> so it should get a timeslice eventually, mem_cgroup_handle_oom sees
> fatal_signal_pending and sets TIF_MEMDIE, bypass the charge, get to
> signal handling, start exiting, fault in, get to charge and bail out in
> __mem_cgroup_try_charge because it sees TIF_MEMDIE.
> 
> Or what am I missing?
> 
> > You don't handle a task that has processed the fatal signal and is in
> > PF_EXITING.  Which is what I experienced.
> > 
> > From earlier comments about my code not being early enough I thought I
> > was going to see a patch in __mem_cgroup_try_change so that the bypass
> > case will kick in also for tasks in PF_EXITING.
> 
> This shouldn't be necessary because TIF_MEMDIE was set for the killed
> task.  I was playing with PF_EXITING there as well but TIF_MEMDIE sounds
> like a more appropriate solution.
> 
> > You change actually addresses things later in the code path than mine
> > does.
> > 
> > I do like your summary of the problem.
> > 
> > Eric
> > 
> > > ---
> > > From 411408558f2858328ea25e69567e9a53a8314032 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko at suse.cz>
> > > Date: Wed, 31 Jul 2013 08:48:54 +0200
> > > Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM
> > >
> > > Eric has reported that he can see task(s) stuck in memcg OOM handler
> > > regularly. The only way out is to
> > > 	echo 0 > $GROUP/memory.oom_controll
> > >
> > > His usecase is:
> > > - Setup a hierarchy with memory and the freezer
> > >   (disable kernel oom and have a process watch for oom).
> > > - In that memory cgroup add a process with one thread per cpu.
> > > - In one thread slowly allocate once per second I think it is 16M of ram
> > >   and mlock and dirty it (just to force the pages into ram and stay there).
> > > - When oom is achieved loop:
> > >   * attempt to freeze all of the tasks.
> > >   * if frozen send every task SIGKILL, unfreeze, remove the directory in
> > >     cgroupfs.
> > >
> > > Eric has then pinpointed the issue to be memcg specific.
> > >
> > > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
> > > Those that have received fatal signal will bypass the charge and should
> > > continue on their way out. The tricky part is that that exit path might
> > > trigger a page fault (e.g. exit_robust_list) thus the memcg charge
> > > while its memcg is still under OOM because nobody has released any
> > > charges. Unlike with the in-kernel OOM handler the exiting task doesn't
> > > get TIF_MEMDIE set so it doesn't shortcut charges and falls to the
> > > memcg OOM again without any way out of it as there are no fatal signals
> > > pending anymore.
> > >
> > > This patch sets the TIF_MEMDIE flag pro actively in mem_cgroup_handle_oom
> > > if the memcg is disabled after the task is woken up with fatal signal
> > > pending. This means that any further charges will be bypassed early in
> > > __mem_cgroup_try_charge and the task will have chance to exit finally.
> > >
> > > Strictly speaking we might mark also a task which hasn't been killed by
> > > userspace OOM handler but this is not harmful as the task is going away
> > > anyway and under-oom group would like to see it go as soon as possible.
> > >
> > > Reported-by: Eric W. Biederman <ebiederm at xmission.com>
> > > Debugged-by: Eric W. Biederman <ebiederm at xmission.com>
> > > Signed-off-by: Michal Hocko <mhocko at suse.cz>
> > > ---
> > >  mm/memcontrol.c | 13 ++++++++++++-
> > >  1 file changed, 12 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index d12ca6f..d4103b0 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2235,8 +2235,19 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > >  
> > >  	mem_cgroup_unmark_under_oom(memcg);
> > >  
> > > -	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> > > +	if (test_thread_flag(TIF_MEMDIE))
> > >  		return false;
> > > +
> > > +	/*
> > > +	 * Userspace OOM killer might have killed this task but
> > > +	 * there is no way it could have set TIF_MEMDIE as well
> > > +	 * so we have to set it manually.
> > > +	 */
> > > +	if (fatal_signal_pending(current)) {
> > > +		if (memcg->oom_kill_disable)
> > > +			set_thread_flag(TIF_MEMDIE);
> > > +		return false;
> > > +	}
> > >  	/* Give chance to dying process */
> > >  	schedule_timeout_uninterruptible(1);
> > >  	return true;
> > > -- 
> > > 1.8.3.2
> > --
> > To unsubscribe from this list: send the line "unsubscribe cgroups" in
> > the body of a message to majordomo at vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs