Protection against container fork bombs [WAS: Re: memcg with kmem limit doesn't recover after disk i/o causes limit to be hit]

Wed Apr 30 13:31:22 UTC 2014

On Wed 30-04-14 00:36:40, Marian Marinov wrote:
> On 04/29/2014 09:27 PM, Michal Hocko wrote:
> >On Tue 29-04-14 19:09:27, Richard Davies wrote:
> >>Dwight Engen wrote:
> >>>Michal Hocko wrote:
> >>>>Tim Hockin wrote:
> >>>>>Here's the reason it doesn't work for us: It doesn't work.
> >>>>
> >>>>There is a "simple" solution for that. Help us to fix it.
> >>>>
> >>>>>It was something like 2 YEARS since we first wanted this, and it
> >>>>>STILL does not work.
> >>>>
> >>>>My recollection is that it was primarily Parallels and Google asking
> >>>>for the kmem accounting. The reason why I didn't fight against
> >>>>inclusion although the implementation at the time didn't have a
> >>>>proper slab shrinking implemented was that that would happen later.
> >>>>Well, that later hasn't happened yet and we are slowly getting there.
> >>>>
> >>>>>You're postponing a pretty simple request indefinitely in
> >>>>>favor of a much more complex feature, which still doesn't really
> >>>>>give me what I want.
> >>>>
> >>>>But we cannot simply add a new interface that will have to be
> >>>>maintained for ever just because something else that is supposed to
> >>>>workaround bugs.
> >>>>
> >>>>>What I want is an API that works like rlimit but per-cgroup, rather
> >>>>>than per-UID.
> >>>>
> >>>>You can use an out-of-tree patchset for the time being or help to get
> >>>>kmem into shape. If there are principal reasons why kmem cannot be
> >>>>used then you better articulate them.
> >>>
> >>>Is there a plan to separately account/limit stack pages vs kmem in
> >>>general? Richard would have to verify, but I suspect kmem is not currently
> >>>viable as a process limiter for him because icache/dcache/stack is all
> >>>accounted together.
> >>
> >>Certainly I would like to be able to limit container fork-bombs without
> >>limiting the amount of disk IO caching for processes in those containers.
> >>
> >>In my testing with of kmem limits, I needed a limit of 256MB or lower to
> >>catch fork bombs early enough. I would definitely like more than 256MB of
> >>disk caching.
> >>
> >>So if we go the "working kmem" route, I would like to be able to specify a
> >>limit excluding disk cache.
> >
> >Page cache (which is what you mean by disk cache probably) is a
> >userspace accounted memory with the memory cgroup controller. And you
> >do not have to limit that one. Kmem accounting refers to kernel internal
> >allocations - slab memory and per process kernel stack. You can see how
> >much memory is allocated per container by memory.kmem.usage_in_bytes or
> >have a look at /proc/slabinfo to see what kind of memory kernel
> >allocates globally and might be accounted for a container as well.
> >
> >The primary problem with the kmem accounting right now is that such a
> >memory is not "reclaimed" and so if the kmem limit is reached all the
> >further kmem allocations fail. The biggest user of the kmem allocations
> >on many systems is dentry and inode chache which is reclaimable easily.
> >When this is implemented the kmem limit will be usable to both prevent
> >forkbombs but also other DOS scenarios when the kernel is pushed to
> >allocate a huge amount of memory.
> 
> I would have to disagree here.
> If a container starts to create many processes it will use kmem, however my use cases, the memory is not the problem.
> The simple scheduling of so many processes generates have load on the machine.
> Even if I have the memory to handle this... the problem becomes the scheduling of all of these processes.

What prevents you from setting the kmem limit to NR_PROC * 8K + slab_pillow?

> Typical rsync of 2-3TB of small files(1-100k) will generate heavy pressure
> on the kmem, but will would not produce many processes.

Once we have a proper slab reclaim implementation this shouldn't be a
problem.

> On the other hand, forking thousands of processes with low memory footprint
> will hit the scheduler a lot faster then hitting the kmem limit.
>
> Kmem limit is something that we need! But firmly believe that we need
> a simple NPROC limit for cgroups.

Once again. If you feel that your usecase is not covered by the kmem
limit follow up on the original email thread I have referenced earlier
in the thread. Splitting up the discussion doesn't help at all.

> -hackman
> 
> >
> >HTH
> >
> >>I am also somewhat worried that normal software use could legitimately go
> >>above 256MB of kmem (even excluding disk cache) - I got to 50MB in testing
> >>just by booting a distro with a few daemons in a container.
> >>
> >>Richard.
> >
> 

-- 
Michal Hocko
SUSE Labs