[PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

Greg Thelen gthelen at google.com
Tue Oct 5 16:26:52 PDT 2010

Minchan Kim <minchan.kim at gmail.com> writes:

> On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
>> If pages are being migrated from a memcg, then updates to that
>> memcg's page statistics are protected by grabbing a bit spin lock
>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>> accounting will be updating memcg page accounting (specifically:
>> num writeback pages) from softirq.  Avoid a deadlocking nested
>> spin lock attempt by disabling interrupts on the local processor
>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>> This avoids the following deadlock:
>> statistic
>>       CPU 0             CPU 1
>>                     inc_file_mapped
>>                     rcu_read_lock
>>   start move
>>   synchronize_rcu
>>                     lock_page_cgroup
>>                       softirq
>>                       test_clear_page_writeback
>>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>                       rcu_read_lock
>>                       lock_page_cgroup   /* deadlock */
>>                       unlock_page_cgroup
>>                       rcu_read_unlock
>>                     unlock_page_cgroup
>>                     rcu_read_unlock
>> By disabling interrupts in lock_page_cgroup, nested calls
>> are avoided.  The softirq would be delayed until after inc_file_mapped
>> enables interrupts when calling unlock_page_cgroup().
>> The normal, fast path, of memcg page stat updates typically
>> does not need to call lock_page_cgroup(), so this change does
>> not affect the performance of the common case page accounting.
>> Signed-off-by: Andrea Righi <arighi at develer.com>
>> Signed-off-by: Greg Thelen <gthelen at google.com>
>> ---
>>  include/linux/page_cgroup.h |    8 +++++-
>>  mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
>>  2 files changed, 36 insertions(+), 23 deletions(-)
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index b59c298..872f6b1 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>>  	return page_zonenum(pc->page);
>>  }
>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>> +				    unsigned long *flags)
>>  {
>> +	local_irq_save(*flags);
>>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>>  }
> Hmm. Let me ask questions. 
> 1. Why do you add new irq disable region in general function?
> I think __do_fault is a one of fast path.

This is true.  I used pft to measure the cost of this extra locking
code.  This pft workload exercises this memcg call stack:

I ran 100 iterations of "pft -m 8g -t 16 -a" and focused on the

First I established a performance baseline using upstream mmotm locking
(not disabling interrupts).
	100 samples: mean 51930.16383  stddev 2.032% (1055.40818272)

Then I introduced this patch, which disabled interrupts in
	100 samples: mean 52174.17434  stddev 1.306% (681.14442646)

Then I replaced this patch's usage of local_irq_save/restore() with
	100 samples: mean 51810.58591  stddev 1.892% (980.340335322)

The proposed patch (#2) actually improves allocation performance by
0.47% when compared to the baseline (#1).  However, I believe that this
is in the statistical noise.  This particular workload does not seem to
be affected this patch.

> Could you disable softirq using _local_bh_disable_ not in general function 
> but in your context?

lock_page_cgroup() is only used by mem_cgroup in memcontrol.c.

local_bh_disable() should also work instead of my proposed patch, which
used local_irq_save/restore().  local_bh_disable() will not disable all
interrupts so it should have less impact.  But I think that usage of
local_bh_disable() it still something that has to happen in the general
lock_page_cgroup() function.  The softirq can occur at an arbitrary time
and processor with the possibility of interrupting anyone who does not
have interrupts or softirq disabled.  Therefore the softirq could
interrupt code that has used lock_page_cgroup(), unless
lock_page_cgroup() explicitly (as proposed) disables interrupts (or
softirq).  If (as you suggest) some calls to lock_page_cgroup() did not
disable softirq, then a deadlock is possible because the softirq may
interrupt the holder of the page cgroup spinlock and the softirq routine
that also wants the spinlock would spin forever.

Is there a preference between local_bh_disable() and local_irq_save()?
Currently the page uses local_irq_save().  However I think it could work
by local_bh_disable(), which might have less system impact.

> how do you expect that how many users need irq lock to update page state?
> If they don't need to disalbe irq?

Are you asking how many cases need to disable irq to update page state?
Because there exists some code (writeback memcg counter update) that
lock the spinlock in softirq, then it must not be allowed to interrupt
any holders of the spinlock.  Therefore any code that locked the
page_cgroup spinlock must disable interrupts (or softirq) to prevent
being preempted by a softirq that will attempt to lock the same

> We can pass some argument which present to need irq lock or not.
> But it seems to make code very ugly. 

This would be ugly and I do not think it would avoid the deadlock
because the softirq for the writeback may occur for a particular page at
any time.  Anyone who might be interrupted by this softirq must either:
a) not hold the page_cgroup spinlock
b) disable interrupts (or softirq) to avoid being preempted by code that
   may want the spinlock.

> 2. So could you solve the problem in your design?
> I mean you could update page state out of softirq?
> (I didn't look at the your patches all. Sorry if I am missing something)

The writeback statistics are normally updated for non-memcg in
test_clear_page_writeback().  Here is an example call stack (innermost

Given that test_clear_page_writeback() is where non-memcg stats are
updated for non-memcg, it seems like the most natural place to update
memcg writeback stats.  Theoretically we could introduce some sort of
work queue of pages that need writeback stat updates.
test_clear_page_writeback() would enqueue to-do work items to this list.
A worker thread (not running in softirq) would process this list and
apply the changes to the mem_cgroup.  This seems very complex and will
likely introduce a longer code path that will introduce even more

> 3. Normally, we have updated page state without disable irq. 
> Why does memcg need it?

Non-memcg writeback stats updates do disable interrupts by using
spin_lock_irqsave().  See upstream test_clear_page_writeback() for
an example.

Memcg must determine the cgroup associated with the page to adjust that
cgroup's page counter.  Example: when a page writeback completes, the
associated mem_cgroup writeback page counter is decremented.  In memcg
this is complicated by the ability to migrate pages between cgroups.
When a page migration is in progress then locking is needed to ensure
that page's associated cgroup does not change until after the statistic
update is complete.  This migration race is already efficiently solved
in mmotm efficiently with mem_cgroup_stealed(), which safely avoids many
unneeded locking calls.  This proposed patch integrates with the
mem_cgroup_stealed() solution.

> I hope we don't add irq disable region as far as possbile. 

I also do not like this, but do not see a better way.  We could use
local_bh_disable(), but I think it needs to be uniformly applied by
adding it to lock_page_cgroup().


More information about the Containers mailing list