[PATCH 2/2] [RFC] Add checkpoint/restart support for epoll files.

Matt Helsley matthltc at us.ibm.com
Wed Aug 26 02:00:43 PDT 2009


On Tue, Aug 25, 2009 at 01:42:20PM -0400, Oren Laadan wrote:
> 
> 
> Matt Helsley wrote:
> > On Thu, Aug 20, 2009 at 12:33:38PM -0400, Oren Laadan wrote:
> >> Hi Matt,
> >>
> >> I like the approach of the patch and use of deferqueue.
> >> See comments below.
> >>
> >> Matt Helsley wrote:
> >>> Save/restore epoll items during checkpoint/restart respectively.
> >>>
> >>> Tests for the cr_tests suite to follow. Tests pass on i386.
> >>>
> >>> TODOs (search the patch for "TODO") that could probably use some
> >>> comments:
> >>>
> 
> [...]
> 
> >>> +
> >>> +struct epoll_deferq_entry {
> >>> +	struct ckpt_ctx *ctx;
> >>> +	struct file *epfile;
> >>> +};
> >> I suggest that:
> >>
> >> 1) Add 'int fd' here,
> >> 2) in ckpt_eventpoll_items use the fd (instead of file objref)
> >>
> >> The motivation is security: if, on restart, you rely on a file objref,
> >> then malicious user can change the file objref and the restart logic
> >> may end up populating some other process's event poll items.
> >>
> >> Instead, by using an fd, we ensure that the restarting process will
> >> only modify a file that it actually owns.
> > 
> > The fd saved with the items is not necessarily matched with the file*
> > anymore. For example userspace could do:
> > 
> > fda = open("/dev/null", O_RDONLY);
> > epoll_ctl(efd, EPOLL_CTL_ADD, fda, ...);
> > fdb = dup(fda);
> > dup2(0, fda);
> > <checkpoint here>
> > epoll_ctl(efd, EPOLL_CTL_MOD, fda, ...);
> > 
> > Saving the fd would associate stdin with the epoll item and not
> > /dev/null.
> > 
> > So we can't use the fd in place of the file reference.
> 
> You are totally right. I was confused by the fact that epoll data
> does save the original fd, and assumed it was "updated" at close()
> and dup() time. But not...
> 
> ... which makes me wonder: is it impossible for a user to delete
> the original fd from the epoll using epoll_ctl(.., EPOLL_CTL_DEL...) ?
> 
> This is related to my concern during restart: the input can provide
> _any_ file objref that is valid at that time during restart, and
> plug in files owned by different processes than intended.
> 
> I'm unsure to what extent this is a security concern, and what's
> the right approach to solve it.

If it is a security concern (and I don't think it's any more a concern
than the other objrefs) then my hunch is that namespaces ought to be
able to help somehow, but I haven't really thought that through.

> Looking at ep_cmp_ffd(), it uses both the file pointer and the fd
> to find a matching entry in the RB tree; So I'd think that in that
> case you need to save both the original FD and the objref of the
> file, no ?

Yup. I believe that's what's in the patch I posted.  Each item has the
fd, events, and data that it was registered with plus a file objref.

> 
> >>> +static int ep_items_restore(void *data)
> >>> +{
> >>> +	struct ckpt_ctx *ctx = *((struct ckpt_ctx**)data);
> >> This is really obfuscated :(
> > 
> > Tell that to the author of deferqueue. ;) Since deferqueue does memcpy()
> > there's really not a better way to pass the ctx pointer.
> 
> Next time I'll use my right to remain silent :p
> 
> Seriously, maybe we can add a wrapper for this (common ?) case of
> a passing a single pointer as data. Do you think this interface
> may be usedful:
> 
> deferqueue_add_ptr(head, ptr, func, dtor)
> {
> 	return deferqueue(head, &ptr, sizeof(ptr), func, dtor);
> }
> 
> and
> 
> void *deferqueue_data_ptr(void *data)
> {
> 	return *((void **) data);
> }
> 
> [...]

Seems like a great idea to me.

> >>> +	struct ckpt_eventpoll_items *h;
> >>> +	struct eventpoll *ep;
> >>> +	struct file *epfile = NULL;
> >>> +	int ret, i = 0, remaining_watches;
> >>> +
> >>> +	/*
> >>> +	 * TODO possible kmalloc failure due to too many watches.
> >>> +	 */
> >>> +	h = ckpt_read_obj(ctx, 0,
> >>> +			  sizeof(*h) + max_user_watches*sizeof(h->items[0]));
> >> Any reason to perfer ckpt_read_obj() over ckpt_read_buf_type() ?
> >> (it will rid the check for type/len below).
> > 
> > So I could read it all at once rather than fiddle with reading the
> > header and chunking all of the input. I wanted to get something simple
> > working first and then make it work with thousands of watches.
> 
> Hmm.. ckpt_read_buf_type() calls ckpt_read_obj() like you do, then
> checks that the type matches.  So you are left with the h->h.len sanity
> check only.
> 
> [...]

So I looked at how pids are handled again. There's a slight difference
in that I also record the epfile objref and number of "items", so I
found I couldn't use similar code. I'm still working on this part of the
patch but I've already incorporated your suggestions for the next post.

> >>> +	if (!is_file_epoll(epfile))
> >>> +		goto out;
> >>> +
> >>> +	ep = epfile->private_data;
> >>> +
> >>> +	ret = -ENOSPC;
> >>> +	remaining_watches = (max_user_watches -
> >>> +			     atomic_read(&ep->user->epoll_watches));
> >>> +	if (h->num_items > remaining_watches)
> >>> +		goto out;
> >> This is tested anyway by ep_insert() below. Besides, it's racy here
> >> (user may  have another restart in parallel, and the check-modify isn't
> >> atomic. Unlike with ep_insert(), this check is for a bunch of fd's not
> >> a single one.
> > 
> > Yes, it's racy. However, we're restarting so we're primarily adding watches.
> > It's also possible we're restarting thousands of watches only to fail
> > when we add the last few. This avoids all that useless work when possible.
> 
> Is this concern worth taking the code out of its "usual" place (the
> common epoll code) and making a special case here ?

I don't think it's worth that.

Incidentally, the next version will factor out and re-use half of epoll_ctl().
The main benefit will be maintenance since it saves only about 5 lines.

> >>
> >> struct ckpt_hdr_eventpoll_item {
> > 	     ^^^^^
> > I strongly dislike this name change idea. There's no header for each item and 
> > it's obviously part of the checkpoint image. If anything, too many
> > things have "_hdr_" in their names when they lack a ckpt_hdr or don't
> > need one: ckpt_hdr_const, ckpt_hdr_pids, ckpt_hdr_socket, ckpt_hdr_sigset,
> > ckpt_hdr_sigaction, ckpt_hdr_siginfo, ckpt_hdr_rlimit, ckpt_hdr_ipc_perms*)
> 
> I have no strong opinion in either way, and what you say makes sense.
> I do strongly prefer a single common convention to all "ckpt_hdr"-less
> structures (and one to all "ckpt_hdr"-full ones). When the patches for
> this show up, I'll pull them :)

OK, I think this might deserve a code comment in the patches:

/*
 * All struct ckpt_hdr_* structs have a struct ckpt_hdr (usually as
 * their first member).
 */

> 
> > 
> > (*We shouldn't need two adjacent headers for ckpt_hdr_ipc_{shm,msg,sem}.)
> 
> Right. I'll remove the ckpt-hdr from ckpt_{hdr_}ipc_perms.

Great, thanks.

Cheers,
	-Matt Helsley


More information about the Containers mailing list