[PATCH 11/15] Signal semantics

Fri Jul 27 11:46:04 PDT 2007

Pavel Emelianov [xemul at openvz.org] wrote:
| Oleg Nesterov wrote:
| >Damn. I don't have time to read these patches today (will try tomorrow),
| 
| Oh, that's OK. I was about to send the set to Andrew only the next week.
| 
| This patch is the most strange one and is to be discussed a lot.
| 
| We try to do the following two things:
| 1. signals going from the namespace, that the target task doesn't
|   see must be seen as SI_KERNEL if siginfo is allocated;
| 2. signals to init of any namespace must be allowed to send from
|   one of the parent namespaces only. From child namespace, init
|   needs only those, that it's ready to handle (SIGCHLD).

Yes.

| 
| As far as I understand Suka's approach (it's his patch, so I may 
| be not 100% correct - it's better to wait for his comments) he is 
| trying to carry the information about the signal up to the
| get_signal_to_deliver().
| 
| As far as the first issue is concerned, the solution is obvious -
| all the "calculations" can be done at the beginning of sending the
| signal, but the second issue is a bit more complicated and I have
| no good ideas of how to solve this :( yet.

Even I am looking for a better approach.

| 
| Thanks,
| Pavel
| 
| >but when I glanced at this patch yesterday I had some suspicions...
| >
| >On 07/26, Pavel Emelyanov wrote:
| >>+++ linux-2.6.23-rc1-mm1-7/kernel/signal.c	2007-07-26 
| >>16:36:37.000000000 +0400
| >>@@ -323,6 +325,9 @@ static int collect_signal(int sig, struc
| >>	if (first) {
| >>		list_del_init(&first->list);
| >>		copy_siginfo(info, &first->info);
| >>+		if (first->flags & SIGQUEUE_CINIT)
| >>+			kinfo->flags |= KERN_SIGINFO_CINIT;
| >>+
| >>
| >>[...snip...]
| >>
| >>@@ -1852,7 +1950,7 @@ relock:
| >>		 * within that pid space. It can of course get signals from
| >>		 * its parent pid space.
| >>		 */
| >>-		if (current == task_child_reaper(current))
| >>+		if (kinfo.flags & KERN_SIGINFO_CINIT)
| >>			continue;
| >
| >I think the whole idea is broken, it assumes the sender put something into
| >"struct sigqueue".
| 
| Yup. That's the problem. It seems to me that the only way to handle init's
| signals is to check for permissions in the sending path.

We can check permissions in the sending path - and in fact we do check for
SIGKILL case (deny_signal_to_container_init() below).

But the receiver knows/decides whether or not the signal is wanted/not. No ?

Are you saying we should check/special case all fatal signals ?

| 
| >Suppose that /sbin/init has no handler for (say) SIGTERM, and we send this
| >signal from the same namespace. send_signal() sets SIGQUEUE_CINIT, but it
| >is lost because __group_complete_signal() silently "converts" sig_fatal()
| >signals to SIGKILL using sigaddset().

Yes, I should have called it out, but this patch currently assumes /sbin/init
(or container-init) has a handler for the fatal signals like SIGTERM and has
a check for SIGKILL (in deny_signal_to_container_init() - as Oleg noted below).

Still looking for better ways to implement.

| >
| >>+static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
| >>+{
| >>+	/*
| >>+	 * If sender (i.e 'current') and receiver have the same active
| >>+	 * pid namespace and the receiver is the container-init, set the
| >>+	 * SIGQUEUE_CINIT flag. This tells the container-init that the
| >>+	 * signal originated in its own namespace and so it can choose
| >>+	 * to ignore the signal.
| >>+	 *
| >>+	 * If the receiver is the container-init of a pid namespace,
| >>+	 * but the sender is from an ancestor pid namespace, the
| >>+	 * container-init cannot ignore the signal. So clear the
| >>+	 * SIGQUEUE_CINIT flag in this case.
| >>+	 *
| >>+	 * Also, if the sender does not have a pid_t in the receiver's
| >>+	 * active pid namespace, set si_pid to 0 and pretend it originated
| >>+	 * from the kernel.
| >>+	 */
| >>+	if (pid_ns_equal(t)) {
| >>+		if (is_container_init(t)) {
| >>+			q->flags |= SIGQUEUE_CINIT;
| >
| >Ironically, this change carefully preserves the bug we already have :)
| >
| >This doesn't protect init from "bad" signal if we send it to sub-thread
| >of init. Actually, this make the behaviour a bit worse compared to what
| >we currently have. Currently, at least the main init's thread survives
| >if we send SIGKILL to sub-thread.

Do you mean "init's main thread" ? But doesn't SIGKILL to any thread kill
the entire process ?

| >
| >>static int send_signal(int sig, struct siginfo *info, struct task_struct 
| >>*t,
| >>			struct sigpending *signals)
| >>{
| >>@@ -710,6 +781,7 @@ static int send_signal(int sig, struct s
| >>			copy_siginfo(&q->info, info);
| >>			break;
| >>		}
| >>+		encode_sender_info(t, q);
| >
| >We still send the signal if __sigqueue_alloc() fails. In that case, the
| >dequeued siginfo won't have SIGQUEUE_CINIT/KERN_SIGINFO_CINIT, not good.

Yes.

| >
| >>@@ -1158,6 +1232,13 @@ static int kill_something_info(int sig, 
| >>
| >>		read_lock(&tasklist_lock);
| >>		for_each_process(p) {
| >>+			/*
| >>+			 * System-wide signals apply only to the sender's
| >>+			 * pid namespace, unless issued from init_pid_ns.
| >>+			 */
| >>+			if (!task_visible_in_pid_ns(p, my_ns))
| >>+				continue;
| >>+
| >>			if (p->pid > 1 && p->tgid != current->tgid) {
| >
| >This "p->pid > 1" check should die.
| >

Ok.

| >>+static int deny_signal_to_container_init(struct task_struct *tsk, int 
| >>sig)
| >>+{
| >>+	/*
| >>+	 * If receiver is the container-init of sender and signal is SIGKILL
| >>+	 * reject it right-away. If signal is any other one, let the 
| >>container
| >>+	 * init decide (in get_signal_to_deliver()) whether to handle it or
| >>+	 * ignore it.
| >>+	 */
| >>+	if (is_container_init(tsk) && (sig == SIGKILL) && pid_ns_equal(tsk))
| >>+		return -EPERM;
| >>+
| >>+	return 0;
| >>+}
| >>+
| >>/*
| >> * Bad permissions for sending the signal
| >> */
| >>@@ -545,6 +584,10 @@ static int check_kill_permission(int sig
| >>	    && !capable(CAP_KILL))
| >>		return error;
| >>
| >>+	error = deny_signal_to_container_init(t, sig);
| >>+	if (error)
| >>+		return error;
| >
| >Hm. Could you explain this change? Why do we need a special check for 
| >SIGKILL?

As you pointed out above, SIGKILL goes through the __group_complete_signal()/
sigaddset() path and bypasses/loses the KERN_SIGINFO_CINIT flag. Other
sig_fatal() signals take this path too, but we assume for now, container-init
has a handler.

| >
| >
| >(What about ptrace_attach() btw? If it is possible to send a signal to init
| > from the "parent" namespace, perhaps it makes sense to allow ptracing as
| > well).
| 
| ptracing of tasks fro different namespaces is not possible at all, since
| strace utility determines the fork()-ed child pid from the parent's eax
| register, which would contain the pid value as this parent sees his child.
| But if the strace is in different namespace - it won't be able to find
| this child with the pid value from parent's eax.
| 
| Maybe it's worth disabling cross-namespaces ptracing...

I think so too. Its probably not a serious limitation ?

| 
| >
| >Oleg.
| >
| >