Need help to debug container's freeze

Sat Oct 30 17:36:52 PDT 2010

On Fri, Oct 29, 2010 at 04:27:40PM +0400, Пётр Волков wrote:
> Hi. We are using lxc to separate different services into containers: for
> this discussions we have apache+php, mysql, nginx containers to serve
> our web application. After upgrade (I think from kernel 2.6.32 into
> something newer, now we are using 2.6.35, but tried 34 too) we've
> experience following issue: at some point nginx starts to show us "504
> Gateway Time-out" error and while it is possible to ssh on server `ps
> aux` hangs (with no ability to stop it), it is impossible to restart
> apache container (hangs on stop) and the only way to fix this is to
> restart server using sysrq or power button. At the same time there is
> nothing in the logs. I suspect apache starts to eat lots of memory but
> oom killer somehow freezes container but I don't have any proves. What

The OOM killer does not freeze tasks. Now if the tasks were already
frozen and if the OOM killer selected them then I can see how that
would be a problem. However, again I doubt that's what's happening here
for several reasons.

1. lxc doesn't arbitrarily freeze tasks -- unless you were checkpointing
	or freezing the task yourself (or using a custom script to do
	so), the tasks in the container's cgroup should not be frozen.

2. If the task(s) are frozen then by definition they are not allocating
	memory. At best they're pinning the memory they've already
	allocated before being frozen. [ The tasks will respond to
	kill signals when thawed. ]

> could you suggest to debug this issue? What sysrq information could be
> useful here?

[ Cc'ing lxc-users at lists.sf.net for lxc-specific debugging ideas/advice. ]

Here's some info on collecting and diagnosing the state of the freezer
so that hopefully we can eliminate your concerns about it being invovled
and confirm what I've said above:

If you want to figure out if the cgroup freezer is involved at all
debugging it requires that you be in the "host". Find out which
process ids are your apache/nginx/etc processes. Then look at their
cgroups in /proc/<pid>/cgroup. Keep in mind that the "/" in those
paths isn't the same as "/" -- it's the directory the cgroup
subsystems are mounted at (see /proc/mounts to figure out where).
You want the line that says "freezer".

Look at the cgroups mount point with the freezer subssystem in the
cgroup(s) of these processes (it'll say "freezer" in the mount options).
Confirm that your pids are listed in the cgroup by looking at the tasks
file.

If the freezer.state file of those cgroups contains the word "THAWED"
then the problem lies elsewhere. If the freezer.state says "FREEZING"
or "FROZEN" however then you'll want to look at the state of the
processes. Some or all should be in the "D" state while "FREEZING".
All should be in "D" state while "FROZEN".

"FREEZING" is an intermediate state however so it's not possible to
determine if there's a bug based purely on the info collected so far.
The best you can do with "FREEZING" is try and write "FROZEN" into
freezer.state one or more times and see if it 'eventually' succeeds
-- say within 10 seconds or 20 attempts, whichever takes longer.
If it doesn't then you need to strace the processes and see if any
are stuck in a syscall -- vfork perhaps. You can also try writing
"THAWED". If it doesn't thaw on the first try then there's a bug.

Whenever you write a new state to freezer.state you should read the
file again to find out whether the state change took place. Some
transitions are handled lazily and only take place when you ask for
the state by reading it.

That's the way to figure out if the freezer is involved and, if so,
where it's stuck.

Cheers,
	-Matt Helsley