[PATCH 5/5] c/r: Add AF_UNIX support (v6)

Dan Smith danms at us.ibm.com
Tue Jul 28 13:34:24 PDT 2009

OL> obj_sock_users() is only required for objects that are to be
OL> tested for leaks in full container checkpoint. I don't think it's
OL> needed, because the relation sock <-> file is 1-to-1.  (If it
OL> were, then you would also need to collect sockets).

Okay, that also answers the question posed by John in the other

OL> (Actually, I will remove checkpoint_bad and restore_bad and modify
OL> checkpoint_obj() and restore_obj() to fail if the respective
OL> method is NULL).

Okay, should I expect that to show up in v17-dev soon?

OL> Nit: I already got confused a few times because of the similar
OL> names.  Perhaps change CKPT_HDR_SOCKET_BUFFERS to
OL> CKPT_HDR_SOCKET_QUEUE (and adjust the header name accordingly).

Agreed.  I remember writing that and remarking to myself how
ridiculous of a name it was, but never went back to change it.

OL> Unless/until we decide otherwise, let's keep all ckpt_hdr_xxx
OL> definitions in a single place -- checkpoint_hdr.h

That's how I initially had it, but Serge asked for it to be moved into
the network headers.  Serge?

OL> Unless you intend to use the struct name 'ckpt_socket' elsewhere,
OL> can we get rid of it (leaving only 'struct {') ?


OL> Can you use fill_name() from checkpoint/file.c ?

Yeah, looks like it.

OL> This direct call to ->getname skips security checks through
OL> getsockname(). You may want to refactor sys_getsockname() similar
OL> to sys_bind().


>> +	if ((h->sock.userlocks & SOCK_SNDBUF_LOCK) &&
>> +	    ((h->sock.sndbuf < SOCK_MIN_SNDBUF) ||
>> +	     (h->sock.sndbuf > sysctl_wmem_max)))
>> +		return -EINVAL;

OL> At least for afunix, if the user did not explicitly set the
OL> sndbuf, then userlocks won't have SOCK_SNDBUF_LOCK set.

OL> Also, if userlocks in the image doesn't have SOCK_SNDBUF_LOCK,
OL> then the sndbuf value isn't checked ?

I think I was thinking that I only needed to verify the buffer value
if the user claimed to have set it (as if it would be ignored
otherwise), but that doesn't seem right.  So, I think the proper
thing to do here is always check it (i.e., remove the first check of
the lock).

OL> What about verifying sock.flags itself ?
OL> In doing that, some options may assume/require some state --
OL> - SOCK_USE_WRITE_QUEUE only used in ipv4/ipv6/sctp
OL> - SOCK_DESTROY only used in some protocols

OL> Perhaps sanitize sock.flags per protocol ?

Hmm, okay.

OL> Many of these direct copy into the socket and sock effectively
OL> bypass security checks that take place in {get,set}sockopt(),
OL> either explicitly (e.g. sk_sndtimeo) or implicitly (e.g.
OL> SOCK_LINGER in sock.flags, reflecting SO_LINGER option).

OL> This applies both to checkpoint (potentially bypassing permission
OL> of the checkpointer to view this data) and restart (bypassing
OL> permissions of user to set these values).

OL> The alternative is to use socksetopt/sockgetopt for those values
OL> that should be subject to security checks.

Yeah, I suppose so.  I've resisted that thus far because it will make
the sync operation so much harder to read, but I suppose it's

OL> There should also be per-protocol consistency checks. E.g. afunix
OL> cannot be in socket.state == SS_{DIS,}CONNECTING

I suppose so, but I don't see anything in af_unix.c that seems to care :)

OL> Better yet: -ENOSYS ?


>> +	ret = kernel_sendmsg(sock->sk_socket, &msg, &kvec, 1, len);

OL> Is it possible to avoid the extra copy using splice() instead ?

It's possible, but it will require some refactoring to get access to
all the right pointers.  I'd propose we push that optimization out
until after we've got these patches integrated into the c/r tree.

OL> SOCK_DEAD in sk->flags may also pose a problem.  (Do we at all
OL> need to checkpoint and/or restore SOCK_DEAD ?!)

Is there any reasonable way we could arrive at a SOCK_DEAD socket via
a valid descriptor?

OL> Hmm... this test is quite hidden here - maybe a fat comment with a
OL> reference to why it's here and what it is doing ?


>> +		else {
>> +			ckpt_debug("Buffer total %u exceeds limit %u\n",
>> +			   h->total_bytes, *bufsize);
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	for (i = 0; i < h->skb_count; i++) {
>> +		ret = sock_read_buffer(ctx, sock);
>> +		ckpt_debug("read buffer %i: %i\n", i, ret);
>> +		if (ret < 0)
>> +			break;
>> +
>> +		total += ret;
>> +		if (total > h->total_bytes) {

OL> What if 'total' overflows (for CAP_NET_ADMIN) ?   perhaps:
OL> 		if (total > h->total_bytes || total < ret) {

Actually, I've changed this locally to require fewer variables and
special cases.  Let me know when I re-post if you don't think it's a
better way to handle this.

OL> Does the following bypass security checks for sys_connect() ?

I don't think so.  We're basically replicating sys_socketpair() here,
which does not do a security check, presumably because all you're
doing is hooking two sockets together that both belong to you.  That's
not to say that we're as safe as that limited operation, but I don't
think it's totally clear.  Perhaps someone more confident will

>> +	/* Prime the socket's buffer limit with the maximum */
>> +	peer->sk_userlocks |= SOCK_SNDBUF_LOCK;
>> +	peer->sk_sndbuf = sysctl_wmem_max;

OL> I suppose this meant to be temporary ?

Right, it will be overridden when we synchronize the socket
properties.  I'll strengthen the comment.

OL> It is indeed a good idea to make it generic for all sockets, but I
OL> suspect the way sock_read_buffers() works will only be suitable
OL> for afunix, and perhaps for in-container inet4/6 (by
OL> "in-container" I mean that both sides of the connection are in the
OL> checkpoint image).

Yeah, I meant to put a comment in the commit log about this.  As I
mentioned before, I was hoping to put most of the effort into the UNIX
patch and move forward with that as we iterate on the INET patch.
Thus, this was part of the buffer restore re-swizzle that clearly
won't work for INET.  In the meantime I've been working on the INET
patch and splitting it up while retaining the necessary behavioral
changes made here.  In my next posting these should look better.

OL> Hmm... does the code below work well with dgram sockets who
OL> received data from the peer, and then the peer died (before
OL> checkpoint) ?

I'm not sure that I've replicated that exact scenario in my tests.
However, when we're restoring socket A with outstanding incoming
buffers, we restore them via B.  If B is not in a reasonable state, I
put it back into connected state to appease the sendmsg function and
restore it when complete.  See the "unshutdown" comment.

>> +static struct file *sock_alloc_attach_fd(struct socket *socket)

OL> Nit: I think this name is misleading - sock_attach_fd() itself is
OL> already and should have been sock_attach_file() IMHO - so perhaps
OL> s/fd/file here ?

The name was chosen to reflect the fact that we're allocating a socket
and then calling sock_attach_fd() on it.  If you think it's less
misleading to be inconsistent with the other call, I'll change it, but
I wouldn't really agree... :)


Dan Smith
IBM Linux Technology Center
email: danms at us.ibm.com

More information about the Containers mailing list