[RFC v3][PATCH 5/9] Memory managemnet (restore)

Dave Hansen dave at linux.vnet.ibm.com
Wed Sep 10 14:42:02 PDT 2008

On Tue, 2008-09-09 at 02:01 -0400, Oren Laadan wrote: 
> > Have you looked at mprotect_fixup()?  It deals with two things:
> > 1. altering the commit charge against RSS if the mapping is actually
> >    writable.
> > 2. Merging the VMA with an adjacent one if possible
> > 
> > We don't want to do either of these two things.  Even if we do merge the
> > VMA, it will be a waste of time and energy since we'll just re-split it
> > when we mprotect() again.
> Your observation is correct; I chose this interface because it's really
> simple and handy. I'm not worried about the performance because such VMAs
> (read only but modified) are really rare, and the code can be optimized
> later on.

The worry is that it will never get cleaned up, and it is basically
cruft as it stands.  People may think that it is here protecting or
fixing something that it is not.

> >>>> +	/* restore original protection for this vma */
> >>>> +	if (!(cr_vma->vm_flags & VM_WRITE))
> >>>> +		ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
> >>>> +
> >>>> + out:
> >>>> +	return ret;
> >>>> +}
> >>> Ugh.  Is this a security hole?  What if the user was not allowed to
> >>> write to the file being mmap()'d by this VMA?  Is this a window where
> >>> someone could come in and (using ptrace or something similar) write to
> >>> the file?
> >> Not a security hole: this is only for private memory, so it never
> >> modifies the underlying file. This is related to what I explained before
> >> about read-only VMAs that have modified pages.
> > 
> > OK, so a shared, read-only mmap() should never get into this code path.
> > What if an attacker modified the checkpoint file to pretend to have
> > pages for a read-only, but shared mmap().  Would this code be tricked?
> VMAs of shared maps (IPC, anonymous shared) will be treated differently.
> VMAs of shared files (mapped shared) are saved without their contents,
> as the contents remains available on the file system !  (yes, for that
> we will eventually need file system snapshots).
> As for an attack that provides an altered checkpoint image: since we
> (currently) don't escalate privileges, the attacker will not be able
> to modify something that it doesn't have access to in the first place.

I bugged Serge about this.  He said that this, at least, bypasses the SE
Linux checks that are normally done with an mprotect() system call.
That's a larger design problem that we need to keep in mind: we need to
be careful to keep existing checks in place.

> >> The process is restarting, inside a container that is restarting. All
> >> tasks inside should be calling sys_restart() (by design) and no other
> >> process from outside should be allowed to ptrace them at this point.
> > 
> > Are there plans to implement this, or is it already in here somehow?
> Once we get positive responses about the current patchset, the next
> step is to handle multiple processes: I plan to extend the freezer
> with two more state for this purpose (dumping, restarting).

OK, but I just asked you why a ptrace() of a process during this
elevated privilege operation couldn't potentially do something bad.  You
responded that, by design, we can't ptrace things.  The design is all
well and good, but the patch isn't, because it doesn't implement that
design. :(  Before we get these merged, that needs to get resolved.

> >> (In any case, if some other tasks ptraces this task, it can make it do
> >> anything anyhow).
> > 
> > No.  I'm suggesting that since this lets you effectively write to
> > something that is not writable, it may be a hole with which to bypass
> > permissions which were set up at an earlier time.
> That's a good comment, but here all we are doing here is to modify a
> privately mapped/anonymous memory.
> > 
> >>> We copy into the process address space all the time when not in its
> >>> context explicitly.  
> >> Huh ?
> > 
> > I'm just saying that you don't need to be in a process's context in
> > order to copy contents into its virtual address space.  Check out
> > access_process_vm().
> > 
> That would be the other way to implement the restart. But, since restart
> executes in task's context, it's simpler and more efficient to leverage
> copy-to-user().
> In terms of security, both methods brings about the same end results: the
> memory is modified (perhaps bypassing the read-only property of the VMA)

But copy_to_user() is fundamentally different.  It writes *over*
contents and in to files.  Simulating a fault fills in those pages, but
it never writes over things or in to files.   Faulting is fundamentally

Faulting today can also handle populating a memory area with pages that
appear read-only via userspace.  That's exactly what we're doing here as

Anyway, I don't expect that you'll agree with this.  I'll prototype
doing it the other way at some point and we can compare how both look.

-- Dave

More information about the Containers mailing list