From lgdt at cpke.com  Mon Sep  4 14:20:54 2017
From: lgdt at cpke.com (=?utf-8?Q?=E8=BD=A8=E6=8C=87?=)
Date: 4 Sep 2017 22:20:54 +0800
Subject: =?utf-8?B?Y29udGFpbmVyc0BsaXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZ+eZvOWYjOS7o+mWizEwMCXnnJ/lmIwxMzYxMDA2NTAzNueGimNvbnRhaW5lcnNAbGlzdHMubGludXgtZm91bmRhdGlvbi5vcmc=?=
Message-ID: <mailman.22.1504535204.1675.containers@lists.linux-foundation.org>

   ???????????13610065036
   ???????????????????????????????????????????????????????
   ??????????????????????????????
   ?????????????????????????????????100%?? ???????
   ??????????????????????? ???????
   ????????????13610065036?????????????????????

From stgraber at ubuntu.com  Mon Sep  4 22:28:57 2017
From: stgraber at ubuntu.com (=?iso-8859-1?Q?St=E9phane?= Graber)
Date: Mon, 4 Sep 2017 18:28:57 -0400
Subject: Linux Plumbers containers micro-conference CFP
In-Reply-To: <20170727182929.t5k665eceewup2xs@castiana>
References: <20170705193033.puyhniz7rvoo572f@castiana>
	<20170727182929.t5k665eceewup2xs@castiana>
Message-ID: <20170904222857.uzafhjdqxxli2e5k@castiana>

On Thu, Jul 27, 2017 at 02:29:29PM -0400, St?phane Graber wrote:
> On Wed, Jul 05, 2017 at 03:30:34PM -0400, St?phane Graber wrote:
> > Hey there,
> > 
> > Linux Plumbers 2017 will be held in Los Angeles, CA between the 13th and
> > 15th of September 2017 including the usual containers micro-conference.
> > 
> > This is a great place to catch up with fellow maintainers and users and
> > to discuss issues that affect us all.
> > 
> > You can find the more detailed CFP here:
> >   https://discuss.linuxcontainers.org/t/containers-micro-conference-at-linux-plumbers-2017/262
> > 
> > CFP closes on the 4th of August 2017.
> > 
> > Looking forward to seeing you there!
> 
> This is a reminder that we're still looking for more submissions for the
> containers micro-conference at Linux Plumbers this fall in Los Angeles.
> 
> We're looking for short talks/demos as well as discussion topics for our
> audience of kernel developers, container runtime maintainers and
> container users!
> 
> 
> Proposals can be submitted here: https://linuxplumbersconf.org/2017/ocw/events/LPC2017/proposals/new
> 
> See you in Los Angeles!
> 
> St?phane
> 
> PS: Forwarding to container projects mailing-lists would be appreciated!

Hey there,

We have now published the schedule for next week's micro-conference:

 https://discuss.linuxcontainers.org/t/containers-micro-conference-schedule/490

See you in Los Angeles!

St?phane
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170904/7d2b4f75/attachment-0001.sig>

From stgraber at ubuntu.com  Wed Sep  6 05:37:38 2017
From: stgraber at ubuntu.com (=?iso-8859-1?Q?St=E9phane?= Graber)
Date: Wed, 6 Sep 2017 01:37:38 -0400
Subject: LXC 2.1 has been released
Message-ID: <20170906053738.j3gsfcmlzbipkgbv@castiana>

Hey there,

After 1.5 years of development, we've finally tagged a new feature
release of LXC.

LXC 2.1 is a normal feature release coming with a year of upstream
support. For production environments, you should stick to LXC 2.0 which
benefits from much longer support.


This new release of LXC introduces a few new security features and
various improvements to the LXC tools and templates.

But more importantly, it's a transitional release ahead of LXC 3.0 to be
released early next year. LXC 3.0 will deprecate a number of tools and
change a large number of the existing configuration keys.

LXC 2.1 will issue warnings whenever the user is using something which
will be removed or renamed in the upcoming LXC 3.0.
An lxc-update-config tool is also provided to automatically convert your
containers' configurations to the new format.


More details about LXC 2.1 can be found in the release announcement:

  https://discuss.linuxcontainers.org/t/lxc-2-1-has-been-released/487

-- 
St?phane Graber
Ubuntu developer
http://www.ubuntu.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170906/87729434/attachment.sig>

From serge at hallyn.com  Wed Sep  6 14:03:42 2017
From: serge at hallyn.com (Serge E. Hallyn)
Date: Wed, 6 Sep 2017 09:03:42 -0500
Subject: [PATCH 2/9] Implement containers as kernel objects
In-Reply-To: <20170818080300.GQ7187@madcap2.tricolour.ca>
References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk>
	<149547016213.10599.1969443294414531853.stgit@warthog.procyon.org.uk>
	<20170814054711.GB29957@madcap2.tricolour.ca>
	<CAHC9VhRgPRa7KeMt8G700aeFvqVYc0gMx__82K31TYY6oQQqTw@mail.gmail.com>
	<20170818080300.GQ7187@madcap2.tricolour.ca>
Message-ID: <20170906140341.GA8729@mail.hallyn.com>

Quoting Richard Guy Briggs (rgb at redhat.com):
...
> > I believe we are going to need a container ID to container definition
> > (namespace, etc.) mapping mechanism regardless of if the container ID
> > is provided by userspace or a kernel generated serial number.  This
> > mapping should be recorded in the audit log when the container ID is
> > created/defined.
> 
> Agreed.
> 
> > > As was suggested in one of the previous threads, if there are any events not
> > > associated with a task (incoming network packets) we log the namespace ID and
> > > then only concern ourselves with its container serial number or container name
> > > once it becomes associated with a task at which point that tracking will be
> > > more important anyways.
> > 
> > Agreed.  After all, a single namespace can be shared between multiple
> > containers.  For those security officers who need to track individual
> > events like this they will have the container ID mapping information
> > in the logs as well so they should be able to trace the unassociated
> > event to a set of containers.
> > 
> > > I'm not convinced that a userspace or kernel generated UUID is that useful
> > > since they are large, not human readable and may not be globally unique given
> > > the "pets vs cattle" direction we are going with potentially identical
> > > conditions in hosts or containers spawning containers, but I see no need to
> > > restrict them.
> > 
> > From a kernel perspective I think an int should suffice; after all,
> > you can't have more containers then you have processes.  If the
> > container engine requires something more complex, it can use the int
> > as input to its own mapping function.
> 
> PIDs roll over.  That already causes some ambiguity in reporting.  If a
> system is constantly spawning and reaping containers, especially
> single-process containers, I don't want to have to worry about that ID
> rolling to keep track of it even though there should be audit records of
> the spawn and death of each container.  There isn't significant cost
> added here compared with some of the other overhead we're dealing with.

Strawman proposal:

1. Each clone/unshare/setns involving a namespace type generates an audit
message along the lines of:

PID 9512 (pid in init_pid_ns) in auditnsid 00000001 cloned CLONE_NEWNS|CLONE_NEWNET
new auditnsid: 00000002
associated namespaces: (list of all namespace filesystem inode numbers)

2. Userspace (i.e. the container logging deamon here) can watch the audit log
for all messages relating to auditnsid 00000002.  Presumably there will be
messages along the lines of "PID 9513 in auditnsid 00000002 cloned...".  The
container logging daemon can track those messages and add the new auditnsids
to the list it watches.

3. If a container is migrated (checkpointed and restored here or elsewhere),
userspace can just follow the appropriate logs for the new containers.

Userspace does not ever *request* a auditnsid.  They are ephemeral, just a
tool to track the namespaces through the audit log.  They are however guaranteed
to never be re-used until reboot.

(Feels like someone must have proposed this before)

-serge

From paul at paul-moore.com  Fri Sep  8 20:02:25 2017
From: paul at paul-moore.com (Paul Moore)
Date: Fri, 8 Sep 2017 16:02:25 -0400
Subject: [PATCH 2/9] Implement containers as kernel objects
In-Reply-To: <20170818080300.GQ7187@madcap2.tricolour.ca>
References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk>
	<149547016213.10599.1969443294414531853.stgit@warthog.procyon.org.uk>
	<20170814054711.GB29957@madcap2.tricolour.ca>
	<CAHC9VhRgPRa7KeMt8G700aeFvqVYc0gMx__82K31TYY6oQQqTw@mail.gmail.com>
	<20170818080300.GQ7187@madcap2.tricolour.ca>
Message-ID: <CAHC9VhQaWSpde6e2M5moERwK-hqff0UH-Z8r7upkGJWqcSXMow@mail.gmail.com>

On Fri, Aug 18, 2017 at 4:03 AM, Richard Guy Briggs <rgb at redhat.com> wrote:
> On 2017-08-16 18:21, Paul Moore wrote:
>> On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <rgb at redhat.com> wrote:
>> > Hi David,
>> >
>> > I wanted to respond to this thread to attempt some constructive feedback,
>> > better late than never.  I had a look at your fsopen/fsmount() patchset(s) to
>> > support this patchset which was interesting, but doesn't directly affect my
>> > work.  The primary patch of interest to the audit kernel folks (Paul Moore and
>> > me) is this patch while the rest of the patchset is interesting, but not likely
>> > to directly affect us.  This patch has most of what we need to solve our
>> > problem.
>> >
>> > Paul and I agree that audit is going to have a difficult time identifying
>> > containers or even namespaces without some change to the kernel.  The audit
>> > subsystem in the kernel needs at least a basic clue about which container
>> > caused an event to be able to report this at the appropriate level and ignore
>> > it at other levels to avoid a DoS.
>>
>> While there is some increased risk of "death by audit", this is really
>> only an issue once we start supporting multiple audit daemons; simply
>> associating auditable events with the container that triggered them
>> shouldn't add any additional overhead (I hope).  For a number of use
>> cases, a single auditd running outside the containers, but recording
>> all their events with some type of container attribution will be
>> sufficient.  This is step #1.
>>
>> However, we will obviously want to go a bit further and support
>> multiple audit daemons on the system to allow containers to
>> record/process their own events (side note: the non-container auditd
>> instance will still see all the events).  There are a number of ways
>> we could tackle this, both via in-kernel and in-userspace record
>> routing, each with their own pros/cons.  However, how this works is
>> going to be dependent on how we identify containers and track their
>> audit events: the bits from step #1.  For this reason I'm not really
>> interested in worrying about the multiple auditd problem just yet;
>> it's obviously important, and something to keep in mind while working
>> up a solution, but it isn't something we should focus on right now.
>>
>> > We also agree that there will need to be some sort of trigger from userspace to
>> > indicate the creation of a container and its allocated resources and we're not
>> > really picky how that is done, such as a clone flag, a syscall or a sysfs write
>> > (or even a read, I suppose), but there will need to be some permission
>> > restrictions, obviously.  (I'd like to see capabilities used for this by adding
>> > a specific container bit to the capabilities bitmask.)
>>
>> To be clear, from an audit perspective I think the only thing we would
>> really care about controlling access to is the creation and assignment
>> of a new audit container ID/token, not necessarily the container
>> itself.  It's a small point, but an important one I think.
>>
>> > I doubt we will be able to accomodate all definitions or concepts of a
>> > container in a timely fashion.  We'll need to start somewhere with a minimum
>> > definition so that we can get traction and actually move forward before another
>> > compelling shared kernel microservice method leaves our entire community
>> > behind.  I'd like to declare that a container is a full set of cloned
>> > namespaces, but this is inefficient, overly constricting and unnecessary for
>> > our needs.  If we could agree on a minimum definition of a container (which may
>> > have only one specific cloned namespace) then we have something on which to
>> > build.  I could even see a container being defined by a trigger sent from
>> > userspace about a process (task) from which all its children are considered to
>> > be within that container, subject to further nesting.
>>
>> I really would prefer if we could avoid defining the term "container".
>> Even if we manage to get it right at this particular moment, we will
>> surely be made fools a year or two from now when things change.  At
>> the very least lets avoid a rigid definition of container, I'll
>> concede that we will probably need to have some definition simply so
>> we can implement something, I just don't want the design or
>> implementation to depend on a particular definition.
>>
>> This comment is jumping ahead a bit, but from an audit perspective I
>> think we handle this by emitting an audit record whenever a container
>> ID is created which describes it as the kernel sees it; as of now that
>> probably means a list of namespace IDs.  Richard mentions this in his
>> email, I just wanted to make it clear that I think we should see this
>> as a flexible mechanism.  At the very least we will likely see a few
>> more namespaces before the world moves on from containers.
>>
>> > In the simplest usable model for audit, if a container (definition implies and)
>> > starts a PID namespace, then the container ID could simply be the container's
>> > "init" process PID in the initial PID namespace.  This assumes that as soon as
>> > that process vanishes, that entire container and all its children are killed
>> > off (which you've done).  There may be some container orchestration systems
>> > that don't use a unique PID namespace per container and that imposing this will
>> > cause them challenges.
>>
>> I don't follow how this would cause challenges if the containers do
>> not use a unique PID namespace; you are suggesting using the PID from
>> in the context of the initial PID namespace, yes?
>
> The PID of the "init" process of a container (PID=1 inside container,
> but PID=containerID from the initial PID namespace perspective).

Yep.  I still don't see how a container not creating a unique PID
namespace presents a challenge here as the unique information would be
taken from the initial PID namespace.

However, based on some off-list discussions I expect this is going to
be a non-issue in the next proposal.

>> Regardless, I do worry that using a PID could potentially be a bit
>> racy once we start jumping between kernel and userspace (audit
>> configuration, logs, etc.).
>
> How do you think this could be racy?  An event happenning before or as
> the container has been defined?

It's racy for the same reasons why we have the pid struct in the
kernel.  If the orchestrator is referencing things via a PID there is
always some danger of a mixup.

>> > If containers have at minimum a unique mount namespace then the root path
>> > dentry inode device and inode number could be used, but there are likely better
>> > identifiers.  Again, there may be container orchestrators that don't use a
>> > unique mount namespace per container and that imposing this will cause
>> > challenges.
>> >
>> > I expect there are similar examples for each of the other namespaces.
>>
>> The PID case is a bit unique as each process is going to have a unique
>> PID regardless of namespaces, but even that has some drawbacks as
>> discussed above.  As for the other namespaces, I agree that we can't
>> rely on them (see my earlier comments).
>
> (In general can you specify which earlier comments so we can be sure to
> what you are referring?)

Really?  How about the race condition concerns.  Come on Richard ...

>> > If we could pick one namespace type for consensus for which each container has
>> > a unique instance of that namespace, we could use the dev/ino tuple from that
>> > namespace as had originally been suggested by Aristeu Rozanski more than 4
>> > years ago as part of the set of namespace IDs.  I had also attempted to
>> > solve this problem by using the namespace' proc inode, then switched over to
>> > generate a unique kernel serial number for each namespace and then went back to
>> > namespace proc dev/ino once Al Viro implemented nsfs:
>> >         v1      https://lkml.org/lkml/2014/4/22/662
>> >         v2      https://lkml.org/lkml/2014/5/9/637
>> >         v3      https://lkml.org/lkml/2014/5/20/287
>> >         v4      https://lkml.org/lkml/2014/8/20/844
>> >         v5      https://lkml.org/lkml/2014/10/6/25
>> >         v6      https://lkml.org/lkml/2015/4/17/48
>> >         v7      https://lkml.org/lkml/2015/5/12/773
>> >
>> > These patches don't use a container ID, but track all namespaces in use for an
>> > event.  This has the benefit of punting this tracking to userspace for some
>> > other tool to analyse and determine to which container an event belongs.
>> > This will use a lot of bandwidth in audit log files when a single
>> > container ID that doesn't require nesting information to be complete
>> > would be a much more efficient use of audit log bandwidth.
>>
>> Relying on a particular namespace to identify a containers is a
>> non-starter from my perspective for all the reasons previously
>> discussed.
>
> I'd rather not either and suspect there isn't much danger of it, but if
> it is determined that there is one namespace in particular that is a
> minimum requirement, I'd prefer to use that nsID instead of creating an
> additional ID.
>
>> > If we rely only on the setting of arbitrary container names from userspace,
>> > then we must provide a map or tree back to the initial audit domain for that
>> > running kernel to be able to differentiate between potentially identical
>> > container names assigned in a nested container system.  If we assign a
>> > container serial number sequentially (atomic64_inc) from the kernel on request
>> > from userspace like the sessionID and log the creation with all nsIDs and the
>> > parent container serial number and/or container name, the nesting is clear due
>> > to lack of ambiguity in potential duplicate names in nesting.  If a container
>> > serial number is used, the tree of inheritance of nested containers can be
>> > rebuilt from the audit records showing what containers were spawned from what
>> > parent.
>>
>> I believe we are going to need a container ID to container definition
>> (namespace, etc.) mapping mechanism regardless of if the container ID
>> is provided by userspace or a kernel generated serial number.  This
>> mapping should be recorded in the audit log when the container ID is
>> created/defined.
>
> Agreed.
>
>> > As was suggested in one of the previous threads, if there are any events not
>> > associated with a task (incoming network packets) we log the namespace ID and
>> > then only concern ourselves with its container serial number or container name
>> > once it becomes associated with a task at which point that tracking will be
>> > more important anyways.
>>
>> Agreed.  After all, a single namespace can be shared between multiple
>> containers.  For those security officers who need to track individual
>> events like this they will have the container ID mapping information
>> in the logs as well so they should be able to trace the unassociated
>> event to a set of containers.
>>
>> > I'm not convinced that a userspace or kernel generated UUID is that useful
>> > since they are large, not human readable and may not be globally unique given
>> > the "pets vs cattle" direction we are going with potentially identical
>> > conditions in hosts or containers spawning containers, but I see no need to
>> > restrict them.
>>
>> From a kernel perspective I think an int should suffice; after all,
>> you can't have more containers then you have processes.  If the
>> container engine requires something more complex, it can use the int
>> as input to its own mapping function.
>
> PIDs roll over.  That already causes some ambiguity in reporting.  If a
> system is constantly spawning and reaping containers, especially
> single-process containers, I don't want to have to worry about that ID
> rolling to keep track of it even though there should be audit records of
> the spawn and death of each container.  There isn't significant cost
> added here compared with some of the other overhead we're dealing with.

Fine, make it a u64.  I believe that's what I've been proposing in the
off-list discussion if memory serves.

A UUID or string are not acceptable from my perspective.  Too big for
the audit records and not really necessary anyway, a u64 should be
just fine.

... and if anyone dares bring up that 640kb quote I swear I'll NACK
all their patches for the next year :)

>> > How do we deal with setns()?  Once it is determined that action is permitted,
>> > given the new combinaiton of namespaces and potential membership in a different
>> > container, record the transition from one container to another including all
>> > namespaces if the latter are a different subset than the target container
>> > initial set.
>>
>> That is a fun one, isn't it?  I think this is where the container
>> ID-to-definition mapping comes into play.  If setns() changes the
>> process such that the existing container ID is no longer valid then we
>> need to do a new lookup in the table to see if another container ID is
>> valid; if no established container ID mappings are valid, the
>> container ID becomes "undefined".
>
> Hopefully we can design this stuff so that container IDs are still valid
> while that transition occurs.
>
>> paul moore
>
> - RGB
>
> --
> Richard Guy Briggs <rgb at redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635

-- 
paul moore
www.paul-moore.com

From ebiederm at xmission.com  Mon Sep 11 17:21:54 2017
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Mon, 11 Sep 2017 12:21:54 -0500
Subject: [GIT PULL] namespace updates for 4.14-rc1
Message-ID: <87mv61cfrh.fsf@xmission.com>


Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

   HEAD: 076a9bcacfc7ccbc2b3fdf3bd490718f6b182419 signal/mips: Remove FPE_FIXME usage from mips

Life has been busy and I have not gotten half as much done this round as
I would have liked.  I delayed it so that a minor conflict resolution
with the mips tree could spend a little time in linux-next before I sent
this pull request.

This pull request includes two long delayed user namespace changes from
Kirill Tkhai.  It also includes a very useful change from Serge Hallyn
that allows the security capability attribute to be used inside of user
namespaces.  The practical effect of this is people can now untar
tarballs and install rpms in user namespaces.  It had been suggested to
generalize this and encode some of the namespace information information
in the xattr name.  Upon close inspection that makes the things that
should be hard easy and the things that should be easy more expensive.

Then there is my bugfix/cleanup for signal injection that removes
the magic encoding of the siginfo union member from the kernel internal
si_code.  The mips folks reported the case where I had used FPE_FIXME me
is impossible so I have remove FPE_FIXME from mips, while at the same
time including a return statement in that case to keep gcc from
complaining about unitialized variables.

I almost finished the work to get make copy_siginfo_to_user a trivial
copy to user.  The code is available at:
   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3
But I did not have time/energy to get the code posted and reviewed before
the merge window opened.

I was able to see that the security excuse for just copying fields that
we know are initialized doesn't work in practice there are buggy
initializations that don't initialize the proper fields in siginfo.  So
we still sometimes copy unitialized data to userspace.

Eric W. Biederman (11):
      signal/alpha: Document a conflict with SI_USER for SIGTRAP
      signal/ia64: Document a conflict with SI_USER with SIGFPE
      signal/sparc: Document a conflict with SI_USER with SIGFPE
      signal/mips: Document a conflict with SI_USER with SIGFPE
      signal/testing: Don't look for __SI_FAULT in userspace
      userns,pidns: Verify the userns for new pid namespaces
      fcntl: Don't use ambiguous SIG_POLL si_codes
      signal: Remove kernel interal si_code magic
      signal: Fix sending signals with siginfo
      mips/signal: In force_fcr31_sig return in the impossible case
      signal/mips: Remove FPE_FIXME usage from mips

Kirill Tkhai (2):
      security: Use user_namespace::level to avoid redundant iterations in cap_capable()
      prctl: Allow local CAP_SYS_ADMIN changing exe_file

Serge E. Hallyn (1):
      Introduce v3 namespaced file capabilities


 arch/alpha/include/uapi/asm/siginfo.h         |  14 ++
 arch/alpha/kernel/traps.c                     |   6 +-
 arch/arm64/kernel/signal32.c                  |  23 +--
 arch/blackfin/include/uapi/asm/siginfo.h      |  30 ++-
 arch/frv/include/uapi/asm/siginfo.h           |   2 +-
 arch/ia64/include/uapi/asm/siginfo.h          |  21 +-
 arch/ia64/kernel/signal.c                     |  17 +-
 arch/ia64/kernel/traps.c                      |   4 +-
 arch/mips/include/uapi/asm/siginfo.h          |   4 +-
 arch/mips/kernel/signal32.c                   |  19 +-
 arch/mips/kernel/traps.c                      |   2 +-
 arch/parisc/kernel/signal32.c                 |  31 ++-
 arch/powerpc/kernel/signal_32.c               |  20 +-
 arch/s390/kernel/compat_signal.c              |  32 ++-
 arch/sparc/include/uapi/asm/siginfo.h         |   9 +-
 arch/sparc/kernel/signal32.c                  |  16 +-
 arch/sparc/kernel/traps_32.c                  |   2 +-
 arch/sparc/kernel/traps_64.c                  |   2 +-
 arch/tile/include/uapi/asm/siginfo.h          |   4 +-
 arch/tile/kernel/compat_signal.c              |  18 +-
 arch/tile/kernel/traps.c                      |   2 +-
 arch/x86/kernel/signal_compat.c               |  21 +-
 fs/fcntl.c                                    |  13 +-
 fs/signalfd.c                                 |  22 +-
 fs/xattr.c                                    |   6 +
 include/linux/capability.h                    |   2 +
 include/linux/security.h                      |   2 +
 include/linux/signal.h                        |  22 ++
 include/linux/user_namespace.h                |   9 +-
 include/uapi/asm-generic/siginfo.h            | 115 +++++------
 include/uapi/linux/capability.h               |  22 +-
 kernel/exit.c                                 |   4 +-
 kernel/pid_namespace.c                        |   4 +
 kernel/ptrace.c                               |   6 +-
 kernel/signal.c                               |  72 +++++--
 kernel/sys.c                                  |   8 +-
 kernel/user_namespace.c                       |  20 +-
 security/commoncap.c                          | 277 ++++++++++++++++++++++++--
 tools/testing/selftests/x86/mpx-mini-test.c   |   3 +-
 tools/testing/selftests/x86/protection_keys.c |  13 +-
 40 files changed, 622 insertions(+), 297 deletions(-)

Eric


From rgb at redhat.com  Wed Sep 13 17:13:28 2017
From: rgb at redhat.com (Richard Guy Briggs)
Date: Wed, 13 Sep 2017 13:13:28 -0400
Subject: RFC: Audit Kernel Container IDs
Message-ID: <20170913171328.GP3405@madcap2.tricolour.ca>

Containers are a userspace concept.  The kernel knows nothing of them.

The Linux audit system needs a way to be able to track the container
provenance of events and actions.  Audit needs the kernel's help to do
this.

Since the concept of a container is entirely a userspace concept, a
trigger signal from the userspace container orchestration system
initiates this.  This will define a point in time and a set of resources
associated with a particular container with an audit container ID.

The trigger is a pseudo filesystem (proc, since PID tree already exists)
write of a u64 representing the container ID to a file representing a
process that will become the first process in a new container.
This might place restrictions on mount namespaces required to define a
container, or at least careful checking of namespaces in the kernel to
verify permissions of the orchestrator so it can't change its own
container ID.
A bind mount of nsfs may be necessary in the container orchestrator's
mntNS.

Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
filesystem to have this action permitted.  At that time, record the
child container's user-supplied 64-bit container identifier along with
the child container's first process (which may become the container's
"init" process) process ID (referenced from the initial PID namespace),
all namespace IDs (in the form of a nsfs device number and inode number
tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
op=$action field.

Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.

Forked and cloned processes inherit their parent's container ID,
referenced in the process' audit_context struct.

Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable.  Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.

Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]

Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.

A process can be moved from one container to another by using the
container assignment method outlined above a second time.

When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.)  A
container object may need a list of processes and/or namespaces.

A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container.  A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.

Feedback please.

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

From carlos at redhat.com  Wed Sep 13 19:33:52 2017
From: carlos at redhat.com (Carlos O'Donell)
Date: Wed, 13 Sep 2017 14:33:52 -0500
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <20170913171328.GP3405@madcap2.tricolour.ca>
References: <20170913171328.GP3405@madcap2.tricolour.ca>
Message-ID: <9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com>

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> Containers are a userspace concept.  The kernel knows nothing of them.

I am looking at this RFC from a userspace perspective, particularly from
the loader's point of view and the unshare syscall and the semantics that
arise from the use of it.

At a high level what you are doing is providing a way to group, without
hierarchy, processes and namespaces. The processes can move between
container's if they have CAP_CONTAINER_ADMIN and can open and write to
a special proc file.

* With unshare a thread may dissociate part of its execution context and
  therefore see a distinct mount namespace. When you say "process" in this
  particular RFC do you exclude the fact that a thread might be in a
  distinct container from the rest of the threads in the process?

> The Linux audit system needs a way to be able to track the container
> provenance of events and actions.  Audit needs the kernel's help to do
> this.

* Why does the Linux audit system need to tracker container provenance?

  - How does it help to provide better audit messages?

  - Is it be enough to list the namespace that a process occupies?

* Why does it need the kernel's help?

  - Is there a race condition that is only fixable with kernel support?

  - Or is it easier with kernel help but not required?

Providing background on these questions would help clarify the
design requirements.

> Since the concept of a container is entirely a userspace concept, a
> trigger signal from the userspace container orchestration system
> initiates this.  This will define a point in time and a set of resources
> associated with a particular container with an audit container ID.

Please don't use the word 'signal', I suggest 'register' since you are
writing to a filesystem.

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.
> A bind mount of nsfs may be necessary in the container orchestrator's
> mntNS.
> 
> Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> filesystem to have this action permitted.  At that time, record the
> child container's user-supplied 64-bit container identifier along with

What is a "child container?" Containers don't have any hierarchy.

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
your continued operation as we have today?

> the child container's first process (which may become the container's
> "init" process) process ID (referenced from the initial PID namespace),
> all namespace IDs (in the form of a nsfs device number and inode number
> tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> op=$action field.

What kind of requirement is there on the first tid/pid registering
the container ID? What if the 8th tid/pid does the registration?
Would that mean that the first process of the container did not
register? It seems like you are suggesting that the registration
by the 8th tid/pid causes a cascading registration progress,
registering all tid/pids in the same grouping? Is that true?

> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> container ID present on an auditable action or event.
> 
> Forked and cloned processes inherit their parent's container ID,
> referenced in the process' audit_context struct.

So a cloned process with CLONE_NEWNS has the came container ID
as the parent process that called clone, at least until the clone
has time to change to a new container ID?

Do you forsee any case where someone might need a semantic that is
slightly different? For example wanting to set the container ID on
clone?

> Log the creation of every namespace, inheriting/adding its spawning
> process' containerID(s), if applicable.  Include the spawning and
> spawned namespace IDs (device and inode number tuples).
> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> Note: At this point it appears only network namespaces may need to track
> container IDs apart from processes since incoming packets may cause an
> auditable event before being associated with a process.

OK.

> Log the destruction of every namespace when it is no longer used by any
> process, include the namespace IDs (device and inode number tuples).
> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> 
> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> the parent and child namespace IDs for any changes to a process'
> namespaces. [setns(2)]
> Note: It may be possible to combine AUDIT_NS_* record formats and
> distinguish them with an op=$action field depending on the fields
> required for each message type.
> 
> A process can be moved from one container to another by using the
> container assignment method outlined above a second time.

OK.

> When a container ceases to exist because the last process in that
> container has exited and hence the last namespace has been destroyed and
> its refcount dropping to zero, log the fact.
> (This latter is likely needed for certification accountability.)  A
> container object may need a list of processes and/or namespaces.

OK.

> A namespace cannot directly migrate from one container to another but
> could be assigned to a newly spawned container.  A namespace can be
> moved from one container to another indirectly by having that namespace
> used in a second process in another container and then ending all the
> processes in the first container.

OK.

> Feedback please.

-- 
Cheers,
Carlos.

From rgb at redhat.com  Thu Sep 14 05:30:08 2017
From: rgb at redhat.com (Richard Guy Briggs)
Date: Thu, 14 Sep 2017 01:30:08 -0400
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com>
References: <20170913171328.GP3405@madcap2.tricolour.ca>
	<9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com>
Message-ID: <20170914053007.GR3405@madcap2.tricolour.ca>

On 2017-09-13 14:33, Carlos O'Donell wrote:
> On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> > Containers are a userspace concept.  The kernel knows nothing of them.
> 
> I am looking at this RFC from a userspace perspective, particularly from
> the loader's point of view and the unshare syscall and the semantics that
> arise from the use of it.
> 
> At a high level what you are doing is providing a way to group, without
> hierarchy, processes and namespaces. The processes can move between
> container's if they have CAP_CONTAINER_ADMIN and can open and write to
> a special proc file.
> 
> * With unshare a thread may dissociate part of its execution context and
>   therefore see a distinct mount namespace. When you say "process" in this
>   particular RFC do you exclude the fact that a thread might be in a
>   distinct container from the rest of the threads in the process?
> 
> > The Linux audit system needs a way to be able to track the container
> > provenance of events and actions.  Audit needs the kernel's help to do
> > this.
> 
> * Why does the Linux audit system need to tracker container provenance?

- ability to filter unwanted, irrelevant or unimportant messages before
  they fill queue so important messages don't get lost.  This is a
  certification requirement.

- ability to make security claims about containers, require tracking of
  actions within those containers to ensure compliance with established
  security policies.

- ability to route messages from events to relevant audit daemon
  instance or host audit daemon instance or both, as required or
  determined by user-initiated rules

>   - How does it help to provide better audit messages?
> 
>   - Is it be enough to list the namespace that a process occupies?

We started with that approach back more than 4 years ago and found it
helped, but didn't go far enough in terms of quick and inexpensive
record filtering and left some doubt about provenance of events in the
case of non-user context events (incoming network packets).

> * Why does it need the kernel's help?
> 
>   - Is there a race condition that is only fixable with kernel support?

This was a concern, but relatively minor compared with the other benefits.

>   - Or is it easier with kernel help but not required?

It is much easier and much less expensive.

> Providing background on these questions would help clarify the
> design requirements.

Here are some references that should help provide some background:
	https://github.com/linux-audit/audit-kernel/issues/32
	RFE: add namespace IDs to audit records

	https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events
	SPEC Virtualization Manager Guest Lifecycle Events

	https://lwn.net/Articles/699819/
	Audit, namespaces, and containers

	https://lwn.net/Articles/723561/
	Containers as kernel objects
	(my reply, with references: https://lkml.org/lkml/2017/8/14/15 )

	https://bugzilla.redhat.com/show_bug.cgi?id=1045666
	audit: add namespace IDs to log records

> > Since the concept of a container is entirely a userspace concept, a
> > trigger signal from the userspace container orchestration system
> > initiates this.  This will define a point in time and a set of resources
> > associated with a particular container with an audit container ID.
> 
> Please don't use the word 'signal', I suggest 'register' since you are
> writing to a filesystem.

Ok, that's a very reasonable request.  'signal' has a previous meaning.

> > The trigger is a pseudo filesystem (proc, since PID tree already exists)
> > write of a u64 representing the container ID to a file representing a
> > process that will become the first process in a new container.
> > This might place restrictions on mount namespaces required to define a
> > container, or at least careful checking of namespaces in the kernel to
> > verify permissions of the orchestrator so it can't change its own
> > container ID.
> > A bind mount of nsfs may be necessary in the container orchestrator's
> > mntNS.
> > 
> > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> > filesystem to have this action permitted.  At that time, record the
> > child container's user-supplied 64-bit container identifier along with
> 
> What is a "child container?" Containers don't have any hierarchy.

Maybe some don't, but that's not likely to last long given the
abstraction and nesting of orchestration tools.  This must be nestable.

> I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
> your continued operation as we have today?

Correct.  It won't prevent processes that otherwise have permissions
today from creating all the namespaces it wishes.

> > the child container's first process (which may become the container's
> > "init" process) process ID (referenced from the initial PID namespace),
> > all namespace IDs (in the form of a nsfs device number and inode number
> > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> > op=$action field.
> 
> What kind of requirement is there on the first tid/pid registering
> the container ID? What if the 8th tid/pid does the registration?
> Would that mean that the first process of the container did not
> register? It seems like you are suggesting that the registration
> by the 8th tid/pid causes a cascading registration progress,
> registering all tid/pids in the same grouping? Is that true?

Ah, good question, I forgot to address that fact.  The intent is that
either threaded processes after initiating threading will not have
permission to execute this, or all the processes in the thread group
will be forced into the same container.  I don't have a strong opinion
on whether or not it must be the lead thread process that must be the
one to receive that registration, but I suspect that would be wise.

> > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> > container ID present on an auditable action or event.
> > 
> > Forked and cloned processes inherit their parent's container ID,
> > referenced in the process' audit_context struct.
> 
> So a cloned process with CLONE_NEWNS has the came container ID
> as the parent process that called clone, at least until the clone
> has time to change to a new container ID?

Yes.

> Do you forsee any case where someone might need a semantic that is
> slightly different? For example wanting to set the container ID on
> clone?

I could envision that situation and I think that might be workable but
for the synchronicity of having one initiated by a specific syscall and
the other initiated by a /proc write.

> > Log the creation of every namespace, inheriting/adding its spawning
> > process' containerID(s), if applicable.  Include the spawning and
> > spawned namespace IDs (device and inode number tuples).
> > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> > Note: At this point it appears only network namespaces may need to track
> > container IDs apart from processes since incoming packets may cause an
> > auditable event before being associated with a process.
> 
> OK.
> 
> > Log the destruction of every namespace when it is no longer used by any
> > process, include the namespace IDs (device and inode number tuples).
> > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> > 
> > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> > the parent and child namespace IDs for any changes to a process'
> > namespaces. [setns(2)]
> > Note: It may be possible to combine AUDIT_NS_* record formats and
> > distinguish them with an op=$action field depending on the fields
> > required for each message type.
> > 
> > A process can be moved from one container to another by using the
> > container assignment method outlined above a second time.
> 
> OK.
> 
> > When a container ceases to exist because the last process in that
> > container has exited and hence the last namespace has been destroyed and
> > its refcount dropping to zero, log the fact.
> > (This latter is likely needed for certification accountability.)  A
> > container object may need a list of processes and/or namespaces.
> 
> OK.
> 
> > A namespace cannot directly migrate from one container to another but
> > could be assigned to a newly spawned container.  A namespace can be
> > moved from one container to another indirectly by having that namespace
> > used in a second process in another container and then ending all the
> > processes in the first container.
> 
> OK.
> 
> > Feedback please.

Thank you sir!

> Carlos.

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

From rgb at redhat.com  Thu Sep 14 05:47:45 2017
From: rgb at redhat.com (Richard Guy Briggs)
Date: Thu, 14 Sep 2017 01:47:45 -0400
Subject: [PATCH 2/9] Implement containers as kernel objects
In-Reply-To: <20170906140341.GA8729@mail.hallyn.com>
References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk>
	<149547016213.10599.1969443294414531853.stgit@warthog.procyon.org.uk>
	<20170814054711.GB29957@madcap2.tricolour.ca>
	<CAHC9VhRgPRa7KeMt8G700aeFvqVYc0gMx__82K31TYY6oQQqTw@mail.gmail.com>
	<20170818080300.GQ7187@madcap2.tricolour.ca>
	<20170906140341.GA8729@mail.hallyn.com>
Message-ID: <20170914054745.GS3405@madcap2.tricolour.ca>

On 2017-09-06 09:03, Serge E. Hallyn wrote:
> Quoting Richard Guy Briggs (rgb at redhat.com):
> ...
> > > I believe we are going to need a container ID to container definition
> > > (namespace, etc.) mapping mechanism regardless of if the container ID
> > > is provided by userspace or a kernel generated serial number.  This
> > > mapping should be recorded in the audit log when the container ID is
> > > created/defined.
> > 
> > Agreed.
> > 
> > > > As was suggested in one of the previous threads, if there are any events not
> > > > associated with a task (incoming network packets) we log the namespace ID and
> > > > then only concern ourselves with its container serial number or container name
> > > > once it becomes associated with a task at which point that tracking will be
> > > > more important anyways.
> > > 
> > > Agreed.  After all, a single namespace can be shared between multiple
> > > containers.  For those security officers who need to track individual
> > > events like this they will have the container ID mapping information
> > > in the logs as well so they should be able to trace the unassociated
> > > event to a set of containers.
> > > 
> > > > I'm not convinced that a userspace or kernel generated UUID is that useful
> > > > since they are large, not human readable and may not be globally unique given
> > > > the "pets vs cattle" direction we are going with potentially identical
> > > > conditions in hosts or containers spawning containers, but I see no need to
> > > > restrict them.
> > > 
> > > From a kernel perspective I think an int should suffice; after all,
> > > you can't have more containers then you have processes.  If the
> > > container engine requires something more complex, it can use the int
> > > as input to its own mapping function.
> > 
> > PIDs roll over.  That already causes some ambiguity in reporting.  If a
> > system is constantly spawning and reaping containers, especially
> > single-process containers, I don't want to have to worry about that ID
> > rolling to keep track of it even though there should be audit records of
> > the spawn and death of each container.  There isn't significant cost
> > added here compared with some of the other overhead we're dealing with.
> 
> Strawman proposal:
> 
> 1. Each clone/unshare/setns involving a namespace type generates an audit
> message along the lines of:
> 
> PID 9512 (pid in init_pid_ns) in auditnsid 00000001 cloned CLONE_NEWNS|CLONE_NEWNET
> new auditnsid: 00000002
> associated namespaces: (list of all namespace filesystem inode numbers)

As you will have seen, this is pretty much what my most recent proposal suggests.

> 2. Userspace (i.e. the container logging deamon here) can watch the audit log
> for all messages relating to auditnsid 00000002.  Presumably there will be
> messages along the lines of "PID 9513 in auditnsid 00000002 cloned...".  The
> container logging daemon can track those messages and add the new auditnsids
> to the list it watches.

Yes.

> 3. If a container is migrated (checkpointed and restored here or elsewhere),
> userspace can just follow the appropriate logs for the new containers.

Yes.

> Userspace does not ever *request* a auditnsid.  They are ephemeral, just a
> tool to track the namespaces through the audit log.  They are however guaranteed
> to never be re-used until reboot.

Well, this is where things get controvertial...  I had wanted this, a
kernel-generated serial number unique to a running kernel to track every
container initiation, but this does have some CRIU challenges pointed
out by Eric Biederman.  Nested containers will not have a consistent
view on a new host and no way to make it consistent.  If we could
guarantee that containers would never be nested, this could be workable.
I think nesting is inevitable in the future given the variety and
creativity of the orchestration tools, so restricting this seems
short-sighted.

At the moment the approch is to let the orchestrator determine the ID of
a container.  Some have argued for as small as u32 and others for a full
UUID.  A u32 runs the risk of rolling, so a u64 seems like a reasonable
step to solve that issue.  Others would like to be able to store a full
UUID which seemed like a good idea on the outset, but on further
thinking, this is something the orchestrator can manage while minimising
the number of bits of required information per audit record to guarantee
we can identify the provenance of a particular audit event.  Let's see
if we can make it work with a u64.

> (Feels like someone must have proposed this before)

Thsee ideas have been thrown around a few times and I'm starting to
understand them better.

> -serge

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

From ebiederm at xmission.com  Thu Sep 14 17:33:06 2017
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Thu, 14 Sep 2017 12:33:06 -0500
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <20170913171328.GP3405@madcap2.tricolour.ca> (Richard Guy
	Briggs's message of "Wed, 13 Sep 2017 13:13:28 -0400")
References: <20170913171328.GP3405@madcap2.tricolour.ca>
Message-ID: <87d16tb2y5.fsf@xmission.com>

Richard Guy Briggs <rgb at redhat.com> writes:

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.

Why a u64?

Why a proc filesystem write and not a magic audit message?
I don't like the fact that the proc filesystem entry is likely going to
be readable and abusable by non-audit contexts?

Why the ability to change the containerid?  What is the use case you are
thinking of there?

Eric

From rgb at redhat.com  Thu Sep 14 18:07:04 2017
From: rgb at redhat.com (Richard Guy Briggs)
Date: Thu, 14 Sep 2017 14:07:04 -0400
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <87d16tb2y5.fsf@xmission.com>
References: <20170913171328.GP3405@madcap2.tricolour.ca>
	<87d16tb2y5.fsf@xmission.com>
Message-ID: <20170914180704.GU3405@madcap2.tricolour.ca>

On 2017-09-14 12:33, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb at redhat.com> writes:
> 
> > The trigger is a pseudo filesystem (proc, since PID tree already exists)
> > write of a u64 representing the container ID to a file representing a
> > process that will become the first process in a new container.
> > This might place restrictions on mount namespaces required to define a
> > container, or at least careful checking of namespaces in the kernel to
> > verify permissions of the orchestrator so it can't change its own
> > container ID.
> 
> Why a u64?

u32 will roll too quickly.  UUID is large enough that it adds
significantly to audit record bandwidth.  I'd prefer u64, but can look
at the difference of accommodating a UUID...

> Why a proc filesystem write and not a magic audit message?

A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use
sparingly.  Given that orchestrators will already require it to send
the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden.

I was originally leaning towards an audit message trigger or a syscall.

> I don't like the fact that the proc filesystem entry is likely going to
> be readable and abusable by non-audit contexts?

This proposal wasn't going to start with that link being readable, but
its filesystem structure and link names would be, perhaps giving away
too much already.

I think we will need to find a way for the orchestrator or one of its
authorized agents to read this information while blocking reads from
unauthorized agents, otherwise this would be of very limited use.

> Why the ability to change the containerid?  What is the use case you are
> thinking of there?

This was covered in the end of the conversation with Paul Moore (that
maybe you got tired reading?)  I'd originally proposed having it write
once, but Paul figured there was no good reason to restrict it and leave
that decision up to the orchestrator.  The use case would be adding
other processes to a container, but it could be argued all additional
processes should be spawned by the first process in a container.

> Eric

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

From rgb at redhat.com  Fri Sep 15 10:19:11 2017
From: rgb at redhat.com (Richard Guy Briggs)
Date: Fri, 15 Sep 2017 06:19:11 -0400
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <20170914053007.GR3405@madcap2.tricolour.ca>
References: <20170913171328.GP3405@madcap2.tricolour.ca>
	<9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com>
	<20170914053007.GR3405@madcap2.tricolour.ca>
Message-ID: <20170915101911.GA21172@madcap2.tricolour.ca>

On 2017-09-14 01:30, Richard Guy Briggs wrote:
> On 2017-09-13 14:33, Carlos O'Donell wrote:
> > On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> > > Containers are a userspace concept.  The kernel knows nothing of them.
> > 
> > I am looking at this RFC from a userspace perspective, particularly from
> > the loader's point of view and the unshare syscall and the semantics that
> > arise from the use of it.
> > 
> > At a high level what you are doing is providing a way to group, without
> > hierarchy, processes and namespaces. The processes can move between
> > container's if they have CAP_CONTAINER_ADMIN and can open and write to
> > a special proc file.

I should clarify: It wasn't intended that a process can see or modify
its own or a peer's special proc container file to be able to set it or
discover its value.  This was only meant for its orchestrator or
delegated agents to do.  This can't be left only to CAP_CONTAINER_ADMIN.
This may require a container to have its own mount namespace if the 
trigger mechanism is a proc file write.  Other methods (additional
namespaces?) may be needed to restrict it for other trigger methods
(syscall?).

> > * With unshare a thread may dissociate part of its execution context and
> >   therefore see a distinct mount namespace. When you say "process" in this
> >   particular RFC do you exclude the fact that a thread might be in a
> >   distinct container from the rest of the threads in the process?
> > 
> > > The Linux audit system needs a way to be able to track the container
> > > provenance of events and actions.  Audit needs the kernel's help to do
> > > this.
> > 
> > * Why does the Linux audit system need to tracker container provenance?
> 
> - ability to filter unwanted, irrelevant or unimportant messages before
>   they fill queue so important messages don't get lost.  This is a
>   certification requirement.
> 
> - ability to make security claims about containers, require tracking of
>   actions within those containers to ensure compliance with established
>   security policies.
> 
> - ability to route messages from events to relevant audit daemon
>   instance or host audit daemon instance or both, as required or
>   determined by user-initiated rules
> 
> >   - How does it help to provide better audit messages?
> > 
> >   - Is it be enough to list the namespace that a process occupies?
> 
> We started with that approach back more than 4 years ago and found it
> helped, but didn't go far enough in terms of quick and inexpensive
> record filtering and left some doubt about provenance of events in the
> case of non-user context events (incoming network packets).
> 
> > * Why does it need the kernel's help?
> > 
> >   - Is there a race condition that is only fixable with kernel support?
> 
> This was a concern, but relatively minor compared with the other benefits.
> 
> >   - Or is it easier with kernel help but not required?
> 
> It is much easier and much less expensive.
> 
> > Providing background on these questions would help clarify the
> > design requirements.
> 
> Here are some references that should help provide some background:
> 	https://github.com/linux-audit/audit-kernel/issues/32
> 	RFE: add namespace IDs to audit records
> 
> 	https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events
> 	SPEC Virtualization Manager Guest Lifecycle Events
> 
> 	https://lwn.net/Articles/699819/
> 	Audit, namespaces, and containers
> 
> 	https://lwn.net/Articles/723561/
> 	Containers as kernel objects
> 	(my reply, with references: https://lkml.org/lkml/2017/8/14/15 )
> 
> 	https://bugzilla.redhat.com/show_bug.cgi?id=1045666
> 	audit: add namespace IDs to log records
> 
> > > Since the concept of a container is entirely a userspace concept, a
> > > trigger signal from the userspace container orchestration system
> > > initiates this.  This will define a point in time and a set of resources
> > > associated with a particular container with an audit container ID.
> > 
> > Please don't use the word 'signal', I suggest 'register' since you are
> > writing to a filesystem.
> 
> Ok, that's a very reasonable request.  'signal' has a previous meaning.
> 
> > > The trigger is a pseudo filesystem (proc, since PID tree already exists)
> > > write of a u64 representing the container ID to a file representing a
> > > process that will become the first process in a new container.
> > > This might place restrictions on mount namespaces required to define a
> > > container, or at least careful checking of namespaces in the kernel to
> > > verify permissions of the orchestrator so it can't change its own
> > > container ID.
> > > A bind mount of nsfs may be necessary in the container orchestrator's
> > > mntNS.
> > > 
> > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> > > filesystem to have this action permitted.  At that time, record the
> > > child container's user-supplied 64-bit container identifier along with
> > 
> > What is a "child container?" Containers don't have any hierarchy.
> 
> Maybe some don't, but that's not likely to last long given the
> abstraction and nesting of orchestration tools.  This must be nestable.

This is why we can't rely only on CAP_CONTAINER_ADMIN to restrict the
ability for self-modification or self-discovery.

> > I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
> > your continued operation as we have today?
> 
> Correct.  It won't prevent processes that otherwise have permissions
> today from creating all the namespaces it wishes.
> 
> > > the child container's first process (which may become the container's
> > > "init" process) process ID (referenced from the initial PID namespace),
> > > all namespace IDs (in the form of a nsfs device number and inode number
> > > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> > > op=$action field.
> > 
> > What kind of requirement is there on the first tid/pid registering
> > the container ID? What if the 8th tid/pid does the registration?
> > Would that mean that the first process of the container did not
> > register? It seems like you are suggesting that the registration
> > by the 8th tid/pid causes a cascading registration progress,
> > registering all tid/pids in the same grouping? Is that true?
> 
> Ah, good question, I forgot to address that fact.  The intent is that
> either threaded processes after initiating threading will not have
> permission to execute this, or all the processes in the thread group
> will be forced into the same container.  I don't have a strong opinion
> on whether or not it must be the lead thread process that must be the
> one to receive that registration, but I suspect that would be wise.
> 
> > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> > > container ID present on an auditable action or event.
> > > 
> > > Forked and cloned processes inherit their parent's container ID,
> > > referenced in the process' audit_context struct.
> > 
> > So a cloned process with CLONE_NEWNS has the came container ID
> > as the parent process that called clone, at least until the clone
> > has time to change to a new container ID?
> 
> Yes.

And as pointed to above, it isn't the process itself that is able to
change to a new container, but its orchestrator to move/assign it.

> > Do you forsee any case where someone might need a semantic that is
> > slightly different? For example wanting to set the container ID on
> > clone?
> 
> I could envision that situation and I think that might be workable but
> for the synchronicity of having one initiated by a specific syscall and
> the other initiated by a /proc write.

The ability to clone while providing a containerID would work really
well, but I'm hesitant to extend or duplicate the clone call.  This
actually sounds like a potentially sane way of approaching it.

> > > Log the creation of every namespace, inheriting/adding its spawning
> > > process' containerID(s), if applicable.  Include the spawning and
> > > spawned namespace IDs (device and inode number tuples).
> > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> > > Note: At this point it appears only network namespaces may need to track
> > > container IDs apart from processes since incoming packets may cause an
> > > auditable event before being associated with a process.
> > 
> > OK.
> > 
> > > Log the destruction of every namespace when it is no longer used by any
> > > process, include the namespace IDs (device and inode number tuples).
> > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> > > 
> > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> > > the parent and child namespace IDs for any changes to a process'
> > > namespaces. [setns(2)]
> > > Note: It may be possible to combine AUDIT_NS_* record formats and
> > > distinguish them with an op=$action field depending on the fields
> > > required for each message type.
> > > 
> > > A process can be moved from one container to another by using the
> > > container assignment method outlined above a second time.
> > 
> > OK.
> > 
> > > When a container ceases to exist because the last process in that
> > > container has exited and hence the last namespace has been destroyed and
> > > its refcount dropping to zero, log the fact.
> > > (This latter is likely needed for certification accountability.)  A
> > > container object may need a list of processes and/or namespaces.
> > 
> > OK.
> > 
> > > A namespace cannot directly migrate from one container to another but
> > > could be assigned to a newly spawned container.  A namespace can be
> > > moved from one container to another indirectly by having that namespace
> > > used in a second process in another container and then ending all the
> > > processes in the first container.
> > 
> > OK.
> > 
> > > Feedback please.
> 
> Thank you sir!
> 
> > Carlos.
> 
> - RGB
> 
> --
> Richard Guy Briggs <rgb at redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
> 
> --
> Linux-audit mailing list
> Linux-audit at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

From ebiederm at xmission.com  Tue Sep 19 02:45:19 2017
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Mon, 18 Sep 2017 21:45:19 -0500
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <20170914180704.GU3405@madcap2.tricolour.ca> (Richard Guy
	Briggs's message of "Thu, 14 Sep 2017 14:07:04 -0400")
References: <20170913171328.GP3405@madcap2.tricolour.ca>
	<87d16tb2y5.fsf@xmission.com>
	<20170914180704.GU3405@madcap2.tricolour.ca>
Message-ID: <87wp4v76f4.fsf@xmission.com>

Richard Guy Briggs <rgb at redhat.com> writes:

> On 2017-09-14 12:33, Eric W. Biederman wrote:
>> Richard Guy Briggs <rgb at redhat.com> writes:
>> 
>> > The trigger is a pseudo filesystem (proc, since PID tree already exists)
>> > write of a u64 representing the container ID to a file representing a
>> > process that will become the first process in a new container.
>> > This might place restrictions on mount namespaces required to define a
>> > container, or at least careful checking of namespaces in the kernel to
>> > verify permissions of the orchestrator so it can't change its own
>> > container ID.
>> 
>> Why a u64?
>
> u32 will roll too quickly.  UUID is large enough that it adds
> significantly to audit record bandwidth.  I'd prefer u64, but can look
> at the difference of accommodating a UUID...

I was imagining a string might be better.  As for the purposes of audit
it is just a byte string you regurgitate.

>> Why a proc filesystem write and not a magic audit message?
>
> A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use
> sparingly.  Given that orchestrators will already require it to send
> the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden.
>
> I was originally leaning towards an audit message trigger or a syscall.
>
>> I don't like the fact that the proc filesystem entry is likely going to
>> be readable and abusable by non-audit contexts?
>
> This proposal wasn't going to start with that link being readable, but
> its filesystem structure and link names would be, perhaps giving away
> too much already.
>
> I think we will need to find a way for the orchestrator or one of its
> authorized agents to read this information while blocking reads from
> unauthorized agents, otherwise this would be of very limited use.

Something that is set only for future audit messages seems reasonable.
Once you start reading this from something other than audit messages I
get neverous, that people will use this beyond audit for things it is
not intended for.

>> Why the ability to change the containerid?  What is the use case you are
>> thinking of there?
>
> This was covered in the end of the conversation with Paul Moore (that
> maybe you got tired reading?)

I have not had time to review everything.  As I was busy preparing for my
wedding and am now in the middle of my honeymoon.

> I'd originally proposed having it write
> once, but Paul figured there was no good reason to restrict it and leave
> that decision up to the orchestrator.  The use case would be adding
> other processes to a container, but it could be argued all additional
> processes should be spawned by the first process in a container.

I see two cases here:
a) Nested containers
b) Inject processes via something like nsenter into a container.

In case a) you have to figure out what to do with nested containers
and that does seem to be a legitimate case for a double write.  Arguably
with the restriction that you must specify a more nested label.

In case b) which you seem to be referring to it would be a process
created by the container manager outside the container that has no
container label.  At which point there is not a need for a double write.

So my recommendation is to not support double writes until you support
nested containers.

Eric

From rgb at redhat.com  Tue Sep 19 04:15:05 2017
From: rgb at redhat.com (Richard Guy Briggs)
Date: Tue, 19 Sep 2017 00:15:05 -0400
Subject: RFC: Audit Kernel Container IDs
In-Reply-To: <87wp4v76f4.fsf@xmission.com>
References: <20170913171328.GP3405@madcap2.tricolour.ca>
	<87d16tb2y5.fsf@xmission.com>
	<20170914180704.GU3405@madcap2.tricolour.ca>
	<87wp4v76f4.fsf@xmission.com>
Message-ID: <20170919041505.GQ3405@madcap2.tricolour.ca>

On 2017-09-18 21:45, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb at redhat.com> writes:
> 
> > On 2017-09-14 12:33, Eric W. Biederman wrote:
> >> Richard Guy Briggs <rgb at redhat.com> writes:
> >> 
> >> > The trigger is a pseudo filesystem (proc, since PID tree already exists)
> >> > write of a u64 representing the container ID to a file representing a
> >> > process that will become the first process in a new container.
> >> > This might place restrictions on mount namespaces required to define a
> >> > container, or at least careful checking of namespaces in the kernel to
> >> > verify permissions of the orchestrator so it can't change its own
> >> > container ID.
> >> 
> >> Why a u64?
> >
> > u32 will roll too quickly.  UUID is large enough that it adds
> > significantly to audit record bandwidth.  I'd prefer u64, but can look
> > at the difference of accommodating a UUID...
> 
> I was imagining a string might be better.  As for the purposes of audit
> it is just a byte string you regurgitate.

Yes, so looking at u128 vs dhowells' proposal, it would be 16 bytes vs
24 bytes, which really isn't that much difference...

What length of string length were you envisioning?

> >> Why a proc filesystem write and not a magic audit message?
> >
> > A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use
> > sparingly.  Given that orchestrators will already require it to send
> > the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden.
> >
> > I was originally leaning towards an audit message trigger or a syscall.
> >
> >> I don't like the fact that the proc filesystem entry is likely going to
> >> be readable and abusable by non-audit contexts?
> >
> > This proposal wasn't going to start with that link being readable, but
> > its filesystem structure and link names would be, perhaps giving away
> > too much already.
> >
> > I think we will need to find a way for the orchestrator or one of its
> > authorized agents to read this information while blocking reads from
> > unauthorized agents, otherwise this would be of very limited use.
> 
> Something that is set only for future audit messages seems reasonable.
> Once you start reading this from something other than audit messages I
> get neverous, that people will use this beyond audit for things it is
> not intended for.

Understandably.  At the same time, if we implement something that is
more broadly useful and solves a number of other challenges others are
facing, how can we make it available while limiting the potential for
abuse?

> >> Why the ability to change the containerid?  What is the use case you are
> >> thinking of there?
> >
> > This was covered in the end of the conversation with Paul Moore (that
> > maybe you got tired reading?)
> 
> I have not had time to review everything.  As I was busy preparing for my
> wedding and am now in the middle of my honeymoon.

I'm very sorry, my bad!  You had given me a heads up about this and I
appologise for causing a stir during your special time.

> > I'd originally proposed having it write
> > once, but Paul figured there was no good reason to restrict it and leave
> > that decision up to the orchestrator.  The use case would be adding
> > other processes to a container, but it could be argued all additional
> > processes should be spawned by the first process in a container.
> 
> I see two cases here:
> a) Nested containers
> b) Inject processes via something like nsenter into a container.
> 
> In case a) you have to figure out what to do with nested containers
> and that does seem to be a legitimate case for a double write.  Arguably
> with the restriction that you must specify a more nested label.

Is this technically a double write if it is an inheritance?  That should
be solvable with a flag.

> In case b) which you seem to be referring to it would be a process
> created by the container manager outside the container that has no
> container label.  At which point there is not a need for a double write.

Looking at the potential for nesting, if the orchestrator is already in
a container, then it would already have a label, but if we refer to the
flag solution above, then it is still the first write.

> So my recommendation is to not support double writes until you support
> nested containers.

I think this is a reasonable restriction.

Thanks for your time.  Sorry to disturb your holiday.

> Eric

- RGB

--
Richard Guy Briggs <rgb at redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

From francis at targetb2bleads.com  Thu Sep 21 19:29:50 2017
From: francis at targetb2bleads.com (Francis A Carey)
Date: Thu, 21 Sep 2017 15:29:50 -0400
Subject: Oracle Open World 2017 Attendees List
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAABfbni9bgfJKiLTHYRCZtw/CgAAAEAAAAMoAdlyYzNBCtTzLW3k/ClUBAAAAAA==@targetb2bleads.com>

 
Hi,

Hope this note finds you well.

 
I thought I'd check if you would be interested in acquiring the attendees
list of "Oracle Open World 2017" for pre-show marketing campaign,
Appointment Setting, Networking and various Marketing initiative which is
held on 01st - 05th Oct |San Francisco, CA| USA.

 
Complete Data fields with 90% accuracy guaranteed on emails:

 
Each record will contain details like: Company name, Website, Contact name,
Postal address, Phone number, Fax Number And Verified Email Address.

                           
 If you are interested, drop me a line. We will get back to you with
pricing, counts and other information for your review.

 
Thank you and I look forward to hear from you soon.

 
Regards,

Francis A Carey| Inside Sales, USA & Europe|

Email: francis at targetb2bleads.com

 
"If you don't wish to receive emails from us reply back with LEAVE OUT"

 
From noreply at jiiga.com  Mon Sep 25 04:42:50 2017
From: noreply at jiiga.com (Canadian-Pharmacy)
Date: Mon, 25 Sep 2017 01:42:50 -0300
Subject: We don't believe in magic and miracles when it comes to our clients'
	health! Be sure!
Message-ID: <E1a2657-0009uq-x7@X7T3>


		Excellent service. Reliable delivery! 

		ENTER HERE
		

From emmayang at sunwardstone.com  Mon Sep 25 17:19:00 2017
From: emmayang at sunwardstone.com (Emma)
Date: Tue, 26 Sep 2017 01:19:00 +0800 (CST)
Subject: Hot Sale-Sunward Quartz and Countertops
Message-ID: <5b6fdfd15eb11457015eb6e4638e0cc6@35MA.sunwardstone.com>

Dear Friends,


Good day to you!
Hope you everything goes well.How are you recently?We would like to forward some of our countertops photos and quartz slabs new price lists to you for checking.Please see attachment.These countertops are shipping to our other North American customers.Our company supply a lot of these products.If you like,you can feel free to contact us.  
Our company is a professional manufacturer and exporter of varies kinds of stone products in Xiamen,China since 2002,mainly producing Granite,Marble,Quartz Countertops & Kitchentops,Cut to Size Tiles& Slabs,Cobblestone and Mosaics.We have attended covering in USA and Marmomacc in Verona,Italy every year.We export a lot of countertops,cut to size tiles and slabs to all over the world.You could visit our website for more our stone products informations.We sincerely hope that we could have a chance to do business with you in this year.  
Hope to hear from you soon.Thanks.


Yours sincerley,
Emma

  
XIAMEN SUNWARD IMP.& EXP. TRADE CO., LTD.
Mobile:0086-13600938482
Tel:0086-592-5901718 Fax:0086-592-5361988,
What'sapp:008613600938482
Skype:emmayang0592
We chat:8273227
Website:www.business-stone.com


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Royal Jade.JPG
Type: application/octet-stream
Size: 110656 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0030.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Royal Jade-3.JPG
Type: application/octet-stream
Size: 94898 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0031.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G687.JPG
Type: application/octet-stream
Size: 157048 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0032.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G687-2.JPG
Type: application/octet-stream
Size: 122786 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0033.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G655.JPG
Type: application/octet-stream
Size: 129071 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0034.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G655-2.JPG
Type: application/octet-stream
Size: 181325 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0035.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G655-3.JPG
Type: application/octet-stream
Size: 194468 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0036.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Ariston Gold Prefab.JPG
Type: application/octet-stream
Size: 141957 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0037.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Butterfly Yellow (2).JPG
Type: application/octet-stream
Size: 136682 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0038.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Giallo Ornamental Tabletops.JPG
Type: application/octet-stream
Size: 221071 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0039.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Hebei Black Tabletop (3).JPG
Type: application/octet-stream
Size: 126882 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0040.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Palo flower.JPG
Type: application/octet-stream
Size: 293652 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0041.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sesame white countertops.jpg
Type: application/octet-stream
Size: 99188 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0042.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sw white quartz countertops-1.jpg
Type: application/octet-stream
Size: 791962 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0043.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SW7101 Carrara White Quartz Rond Tabletop.JPG
Type: application/octet-stream
Size: 126413 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0044.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Tropical Brown.JPG
Type: application/octet-stream
Size: 279234 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0045.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Tropical Brown-1.jpg
Type: application/octet-stream
Size: 516550 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0046.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G682 island.JPG
Type: application/octet-stream
Size: 159946 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0047.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G687 island.JPG
Type: application/octet-stream
Size: 167483 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0048.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sunward Swan White Granite Kitchentop-1.jpg
Type: application/octet-stream
Size: 205516 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0049.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Crystal White Artificial Quartz Countertop  (2).JPG
Type: application/octet-stream
Size: 345480 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0050.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Carrara White Artificial Quartz Countertop  (1).JPG
Type: application/octet-stream
Size: 771041 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0051.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fantasy Gold  (1).jpg
Type: application/octet-stream
Size: 226024 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0052.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fantasy Gold  (2).jpg
Type: application/octet-stream
Size: 233615 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0053.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G682A# (12).jpg
Type: application/octet-stream
Size: 222054 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0054.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: South African Gold  (1).JPG
Type: application/octet-stream
Size: 411175 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0055.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Galaxy White-1.jpg
Type: application/octet-stream
Size: 123051 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0056.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Galaxy White-2.jpg
Type: application/octet-stream
Size: 120412 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0057.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: White Rose Vanity Top-3.jpg
Type: application/octet-stream
Size: 129247 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0058.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sunward Artificial Quartz Slab Price-2017.xls
Type: application/octet-stream
Size: 1641472 bytes
Desc: not available
URL: <http://lists.linuxfoundation.org/pipermail/containers/attachments/20170926/e63ef083/attachment-0059.obj>

From ebiederm at xmission.com  Thu Sep 28 22:34:53 2017
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Thu, 28 Sep 2017 17:34:53 -0500
Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr
	hooks behave
Message-ID: <87tvzmqwoi.fsf@xmission.com>


It looks like once upon a time a long time ago selinux copied code
from cap_inode_removexattr and cap_inode_setxattr into
selinux_inode_setotherxattr.  However the code has now diverged and
selinux is implementing a policy that is quite different than
cap_inode_setxattr and cap_inode_removexattr especially when it comes
to the security.capable xattr.

To keep things working and to make the comments in security/security.c
correct when the xattr is securit.capable, call cap_inode_setxattr
or cap_inode_removexattr as appropriate.

I suspect there is a larger conversation to be had here but this
is enough to keep selinux from implementing a non-sense hard coded
policy that breaks other parts of the kernel.

Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
---
 security/selinux/hooks.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index f5d304736852..edf4bd292dc7 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct dentry *dentry, const char *name,
 	u32 newsid, sid = current_sid();
 	int rc = 0;
 
+	if (strcmp(name, XATTR_NAME_CAPS) == 0)
+		return cap_inode_setxattr(dentry, name, value, size, flags);
+
 	if (strcmp(name, XATTR_NAME_SELINUX))
 		return selinux_inode_setotherxattr(dentry, name);
 
@@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct dentry *dentry)
 
 static int selinux_inode_removexattr(struct dentry *dentry, const char *name)
 {
+	if (strcmp(name, XATTR_NAME_CAPS) == 0)
+		return cap_inode_removexattr(dentry, name);
+
 	if (strcmp(name, XATTR_NAME_SELINUX))
 		return selinux_inode_setotherxattr(dentry, name);
 
-- 
2.14.1


From casey at schaufler-ca.com  Fri Sep 29 01:16:06 2017
From: casey at schaufler-ca.com (Casey Schaufler)
Date: Thu, 28 Sep 2017 18:16:06 -0700
Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr
	hooks behave
In-Reply-To: <87tvzmqwoi.fsf@xmission.com>
References: <87tvzmqwoi.fsf@xmission.com>
Message-ID: <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>

On 9/28/2017 3:34 PM, Eric W. Biederman wrote:
> It looks like once upon a time a long time ago selinux copied code
> from cap_inode_removexattr and cap_inode_setxattr into
> selinux_inode_setotherxattr.  However the code has now diverged and
> selinux is implementing a policy that is quite different than
> cap_inode_setxattr and cap_inode_removexattr especially when it comes
> to the security.capable xattr.

What leads you to believe that this isn't intentional?
It's most likely the case that this change occurred as
part of the first round module stacking change. What behavior
do you see that you're unhappy with?

>
> To keep things working

Which "things"? How are they not "working"?

>  and to make the comments in security/security.c
> correct when the xattr is securit.capable, call cap_inode_setxattr
> or cap_inode_removexattr as appropriate.
>
> I suspect there is a larger conversation to be had here but this
> is enough to keep selinux from implementing a non-sense hard coded
> policy that breaks other parts of the kernel.

Specifics, please. Since I can't guess what problem you've
encountered I can't tell if it's here, in the infrastructure,
or in your perception of what constitutes "broken".

>
> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> ---
>  security/selinux/hooks.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index f5d304736852..edf4bd292dc7 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct dentry *dentry, const char *name,
>  	u32 newsid, sid = current_sid();
>  	int rc = 0;
>  
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> +		return cap_inode_setxattr(dentry, name, value, size, flags);
> +

No. Don't even think of contemplating considering embedding the cap
attribute check in the SELinux code. cap_inode_setxattr() is called in
the infrastructure. 

>  	if (strcmp(name, XATTR_NAME_SELINUX))
>  		return selinux_inode_setotherxattr(dentry, name);
>  
> @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct dentry *dentry)
>  
>  static int selinux_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> +		return cap_inode_removexattr(dentry, name);
> +
>  	if (strcmp(name, XATTR_NAME_SELINUX))
>  		return selinux_inode_setotherxattr(dentry, name);
>  


.

From sds at tycho.nsa.gov  Fri Sep 29 12:36:41 2017
From: sds at tycho.nsa.gov (Stephen Smalley)
Date: Fri, 29 Sep 2017 08:36:41 -0400
Subject: [RFC][PATCH] security: Make the selinux setxattr and
	removexattr hooks behave
In-Reply-To: <87tvzmqwoi.fsf@xmission.com>
References: <87tvzmqwoi.fsf@xmission.com>
Message-ID: <1506688601.5571.1.camel@tycho.nsa.gov>

On Thu, 2017-09-28 at 17:34 -0500, Eric W. Biederman wrote:
> It looks like once upon a time a long time ago selinux copied code
> from cap_inode_removexattr and cap_inode_setxattr into
> selinux_inode_setotherxattr.??However the code has now diverged and
> selinux is implementing a policy that is quite different than
> cap_inode_setxattr and cap_inode_removexattr especially when it comes
> to the security.capable xattr.
> 
> To keep things working and to make the comments in
> security/security.c
> correct when the xattr is securit.capable, call cap_inode_setxattr
> or cap_inode_removexattr as appropriate.
> 
> I suspect there is a larger conversation to be had here but this
> is enough to keep selinux from implementing a non-sense hard coded
> policy that breaks other parts of the kernel.

Originally SELinux called the cap functions directly since there was no
stacking support in the infrastructure and one had to manually stack a
secondary module internally.  inode_setxattr and inode_removexattr
however were special cases because the cap functions would check
CAP_SYS_ADMIN for any non-capability attributes in the security.*
namespace, and we don't want to impose that requirement on setting
security.selinux.  Thus, we inlined the capabilities logic into the
selinux hook functions and adapted it appropriately.  When the stacking
support was introduced, it had to also special case these hooks so that
only the primary module's hook is used for the same reason; otherwise,
the kernel would end up applying a CAP_SYS_ADMIN check on setting
security.selinux.  Your change below is almost but not quite right
since it only calls the cap functions when setting the capability
attribute; the residual problem is that it will then skip the SELinux
FILE__SETATTR (file setattr) permission check when setting those
attributes, which we want to retain.  So you need to only return early
if cap_inode_setxattr()/removexattr() return an error; otherwise, you
need to proceed to the SELinux check, and you can then delete the
duplicated logic from selinux_inode_setotherxattr().  At which point it
just becomes a call to dentry_has_perm() and you can just inline that
into selinux_inode_setxattr() and selinux_inode_removexattr().

> 
> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> ---
> ?security/selinux/hooks.c | 6 ++++++
> ?1 file changed, 6 insertions(+)
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index f5d304736852..edf4bd292dc7 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct dentry
> *dentry, const char *name,
> ?	u32 newsid, sid = current_sid();
> ?	int rc = 0;
> ?
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> +		return cap_inode_setxattr(dentry, name, value, size,
> flags);
> +
> ?	if (strcmp(name, XATTR_NAME_SELINUX))
> ?		return selinux_inode_setotherxattr(dentry, name);
> ?
> @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct
> dentry *dentry)
> ?
> ?static int selinux_inode_removexattr(struct dentry *dentry, const
> char *name)
> ?{
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> +		return cap_inode_removexattr(dentry, name);
> +
> ?	if (strcmp(name, XATTR_NAME_SELINUX))
> ?		return selinux_inode_setotherxattr(dentry, name);
> ?

From sds at tycho.nsa.gov  Fri Sep 29 14:18:57 2017
From: sds at tycho.nsa.gov (Stephen Smalley)
Date: Fri, 29 Sep 2017 10:18:57 -0400
Subject: [RFC][PATCH] security: Make the selinux setxattr and
	removexattr hooks behave
In-Reply-To: <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>
References: <87tvzmqwoi.fsf@xmission.com>
	<1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>
Message-ID: <1506694737.5571.9.camel@tycho.nsa.gov>

On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote:
> On 9/28/2017 3:34 PM, Eric W. Biederman wrote:
> > It looks like once upon a time a long time ago selinux copied code
> > from cap_inode_removexattr and cap_inode_setxattr into
> > selinux_inode_setotherxattr.??However the code has now diverged and
> > selinux is implementing a policy that is quite different than
> > cap_inode_setxattr and cap_inode_removexattr especially when it
> > comes
> > to the security.capable xattr.
> 
> What leads you to believe that this isn't intentional?
> It's most likely the case that this change occurred as
> part of the first round module stacking change. What behavior
> do you see that you're unhappy with?
> 
> > 
> > To keep things working
> 
> Which "things"? How are they not "working"?
> 
> > ?and to make the comments in security/security.c
> > correct when the xattr is securit.capable, call cap_inode_setxattr
> > or cap_inode_removexattr as appropriate.
> > 
> > I suspect there is a larger conversation to be had here but this
> > is enough to keep selinux from implementing a non-sense hard coded
> > policy that breaks other parts of the kernel.
> 
> Specifics, please. Since I can't guess what problem you've
> encountered I can't tell if it's here, in the infrastructure,
> or in your perception of what constitutes "broken".
> 
> > 
> > Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> > ---
> > ?security/selinux/hooks.c | 6 ++++++
> > ?1 file changed, 6 insertions(+)
> > 
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index f5d304736852..edf4bd292dc7 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct
> > dentry *dentry, const char *name,
> > ?	u32 newsid, sid = current_sid();
> > ?	int rc = 0;
> > ?
> > +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> > +		return cap_inode_setxattr(dentry, name, value,
> > size, flags);
> > +
> 
> No. Don't even think of contemplating considering embedding the cap
> attribute check in the SELinux code. cap_inode_setxattr() is called
> in
> the infrastructure.

Except that it isn't, not if any other security module is enabled and
implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when
setting security.selinux or security.SMACK*.

An alternative approach to fixing this would be to change the cap
functions to only apply their checks if setting the capability
attribute and defer any checks on other security.* attributes to either
the security framework or the other security modules.  Then the
framework could always call all the modules on the inode_setxattr and
inode_removexattr hooks as with other hooks.  The security framework
would then need to ensure that a check is still applied when setting
security.* attributes if it isn't already handled by one of the enabled
security modules, as you don't want unprivileged userspace to be able
to set arbitrary security.foo attributes or to set up security.selinux
or security.SMACK* attributes if those modules happen to be disabled.

> ?
> 
> > ?	if (strcmp(name, XATTR_NAME_SELINUX))
> > ?		return selinux_inode_setotherxattr(dentry, name);
> > ?
> > @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct
> > dentry *dentry)
> > ?
> > ?static int selinux_inode_removexattr(struct dentry *dentry, const
> > char *name)
> > ?{
> > +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> > +		return cap_inode_removexattr(dentry, name);
> > +
> > ?	if (strcmp(name, XATTR_NAME_SELINUX))
> > ?		return selinux_inode_setotherxattr(dentry, name);
> > ?
> 
> 
> .

From casey at schaufler-ca.com  Fri Sep 29 15:46:21 2017
From: casey at schaufler-ca.com (Casey Schaufler)
Date: Fri, 29 Sep 2017 08:46:21 -0700
Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr
	hooks behave
In-Reply-To: <1506694737.5571.9.camel@tycho.nsa.gov>
References: <87tvzmqwoi.fsf@xmission.com>
	<1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>
	<1506694737.5571.9.camel@tycho.nsa.gov>
Message-ID: <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com>

On 9/29/2017 7:18 AM, Stephen Smalley wrote:
> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote:
>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote:
>>> It looks like once upon a time a long time ago selinux copied code
>>> from cap_inode_removexattr and cap_inode_setxattr into
>>> selinux_inode_setotherxattr.??However the code has now diverged and
>>> selinux is implementing a policy that is quite different than
>>> cap_inode_setxattr and cap_inode_removexattr especially when it
>>> comes
>>> to the security.capable xattr.
>> What leads you to believe that this isn't intentional?
>> It's most likely the case that this change occurred as
>> part of the first round module stacking change. What behavior
>> do you see that you're unhappy with?
>>
>>> To keep things working
>> Which "things"? How are they not "working"?
>>
>>> ?and to make the comments in security/security.c
>>> correct when the xattr is securit.capable, call cap_inode_setxattr
>>> or cap_inode_removexattr as appropriate.
>>>
>>> I suspect there is a larger conversation to be had here but this
>>> is enough to keep selinux from implementing a non-sense hard coded
>>> policy that breaks other parts of the kernel.
>> Specifics, please. Since I can't guess what problem you've
>> encountered I can't tell if it's here, in the infrastructure,
>> or in your perception of what constitutes "broken".
>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>>> ---
>>> ?security/selinux/hooks.c | 6 ++++++
>>> ?1 file changed, 6 insertions(+)
>>>
>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>> index f5d304736852..edf4bd292dc7 100644
>>> --- a/security/selinux/hooks.c
>>> +++ b/security/selinux/hooks.c
>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct
>>> dentry *dentry, const char *name,
>>> ?	u32 newsid, sid = current_sid();
>>> ?	int rc = 0;
>>> ?
>>> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>>> +		return cap_inode_setxattr(dentry, name, value,
>>> size, flags);
>>> +
>> No. Don't even think of contemplating considering embedding the cap
>> attribute check in the SELinux code. cap_inode_setxattr() is called
>> in
>> the infrastructure.
> Except that it isn't, not if any other security module is enabled and
> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when
> setting security.selinux or security.SMACK*.

OK. Yes, this bit of the infrastructure is some of the
worst I've done in a long time. This is a case where we
already need special case stacking infrastructure. It looks
like we'll have to separate setting the cap attribute from
checking the cap state in order to make this work. In any
case, the security_inode_setxattr() code is where the change
belongs. There will likely be fallout changes in the modules,
including the cap module.
?

> An alternative approach to fixing this would be to change the cap
> functions to only apply their checks if setting the capability
> attribute and defer any checks on other security.* attributes to either
> the security framework or the other security modules.  Then the
> framework could always call all the modules on the inode_setxattr and
> inode_removexattr hooks as with other hooks.  The security framework
> would then need to ensure that a check is still applied when setting
> security.* attributes if it isn't already handled by one of the enabled
> security modules, as you don't want unprivileged userspace to be able
> to set arbitrary security.foo attributes or to set up security.selinux
> or security.SMACK* attributes if those modules happen to be disabled.

Agreed. This isn't a two line change. Grumble.

I can guess at what the problem might be, but I hate making
assumptions when I go to fix a problem. I will start looking
at a patch, but it would really help if I could say for sure
what I'm out to accomplish. It may be obvious to the casual
observer, but that description has not been applied to me very
often.

>
>> ?
>>
>>> ?	if (strcmp(name, XATTR_NAME_SELINUX))
>>> ?		return selinux_inode_setotherxattr(dentry, name);
>>> ?
>>> @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct
>>> dentry *dentry)
>>> ?
>>> ?static int selinux_inode_removexattr(struct dentry *dentry, const
>>> char *name)
>>> ?{
>>> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>>> +		return cap_inode_removexattr(dentry, name);
>>> +
>>> ?	if (strcmp(name, XATTR_NAME_SELINUX))
>>> ?		return selinux_inode_setotherxattr(dentry, name);
>>> ?
>>
>> .
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


.

From ebiederm at xmission.com  Sat Sep 30 16:22:55 2017
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Sat, 30 Sep 2017 11:22:55 -0500
Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr
	hooks behave
In-Reply-To: <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com> (Casey
	Schaufler's message of "Fri, 29 Sep 2017 08:46:21 -0700")
References: <87tvzmqwoi.fsf@xmission.com>
	<1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>
	<1506694737.5571.9.camel@tycho.nsa.gov>
	<6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com>
Message-ID: <87vak0ma00.fsf@xmission.com>

Casey Schaufler <casey at schaufler-ca.com> writes:

> On 9/29/2017 7:18 AM, Stephen Smalley wrote:
>> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote:
>>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote:
>>>> It looks like once upon a time a long time ago selinux copied code
>>>> from cap_inode_removexattr and cap_inode_setxattr into
>>>> selinux_inode_setotherxattr.??However the code has now diverged and
>>>> selinux is implementing a policy that is quite different than
>>>> cap_inode_setxattr and cap_inode_removexattr especially when it
>>>> comes
>>>> to the security.capable xattr.
>>> What leads you to believe that this isn't intentional?
>>> It's most likely the case that this change occurred as
>>> part of the first round module stacking change. What behavior
>>> do you see that you're unhappy with?
>>>
>>>> To keep things working
>>> Which "things"? How are they not "working"?
>>>
>>>> ?and to make the comments in security/security.c
>>>> correct when the xattr is securit.capable, call cap_inode_setxattr
>>>> or cap_inode_removexattr as appropriate.
>>>>
>>>> I suspect there is a larger conversation to be had here but this
>>>> is enough to keep selinux from implementing a non-sense hard coded
>>>> policy that breaks other parts of the kernel.
>>> Specifics, please. Since I can't guess what problem you've
>>> encountered I can't tell if it's here, in the infrastructure,
>>> or in your perception of what constitutes "broken".
>>>
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>>>> ---
>>>> ?security/selinux/hooks.c | 6 ++++++
>>>> ?1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>>> index f5d304736852..edf4bd292dc7 100644
>>>> --- a/security/selinux/hooks.c
>>>> +++ b/security/selinux/hooks.c
>>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct
>>>> dentry *dentry, const char *name,
>>>> ?	u32 newsid, sid = current_sid();
>>>> ?	int rc = 0;
>>>> ?
>>>> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>>>> +		return cap_inode_setxattr(dentry, name, value,
>>>> size, flags);
>>>> +
>>> No. Don't even think of contemplating considering embedding the cap
>>> attribute check in the SELinux code. cap_inode_setxattr() is called
>>> in
>>> the infrastructure.
>> Except that it isn't, not if any other security module is enabled and
>> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when
>> setting security.selinux or security.SMACK*.
>
> OK. Yes, this bit of the infrastructure is some of the
> worst I've done in a long time. This is a case where we
> already need special case stacking infrastructure. It looks
> like we'll have to separate setting the cap attribute from
> checking the cap state in order to make this work. In any
> case, the security_inode_setxattr() code is where the change
> belongs. There will likely be fallout changes in the modules,
> including the cap module.
> ?
>
>> An alternative approach to fixing this would be to change the cap
>> functions to only apply their checks if setting the capability
>> attribute and defer any checks on other security.* attributes to either
>> the security framework or the other security modules.  Then the
>> framework could always call all the modules on the inode_setxattr and
>> inode_removexattr hooks as with other hooks.  The security framework
>> would then need to ensure that a check is still applied when setting
>> security.* attributes if it isn't already handled by one of the enabled
>> security modules, as you don't want unprivileged userspace to be able
>> to set arbitrary security.foo attributes or to set up security.selinux
>> or security.SMACK* attributes if those modules happen to be disabled.
>
> Agreed. This isn't a two line change. Grumble.
>
> I can guess at what the problem might be, but I hate making
> assumptions when I go to fix a problem. I will start looking
> at a patch, but it would really help if I could say for sure
> what I'm out to accomplish. It may be obvious to the casual
> observer, but that description has not been applied to me very
> often.

Apologies for the delayed reply.

I am looking at security_inode_setxattr.

For setting attributes in the security.* the generic code in fs/xattr.c
applies no permission checks.

Each security module that implements an xattr in security.* then imposes
it's own policy on it's own attribute.

For smack the basic rule is smack_privileged(CAP_MAC_ADMIN).
For selinux the basic rule is inode_or_owner_capable(inode).
For commoncap the basic rule is capable_wrt_inode_uidgid(inode, CAP_SETFCAP).

commoncap also applies a default policity to setting security.* xattrs.
ns_capable(dentry->d_sb->s_userns, CAP_SYS_ADMIN).

smack reuses that default policy by calling cap_inode_setxattr if it
isn't a smack security.* xattr.

selinux has what looks like an old copy of the commoncap checks for
the security.* in selinux_inode_setotherxattr.  Testing for
capable(CAP_SETFCAP) for security.capable and capable(CAP_SYS_ADMIN)
for the others.

With the added complication that selinux calls
selinux_inode_setotherxattr also for the remove_xattr case.  So fixing
this in selinux_inode_setotherxattr is not appropriate.

I believe selinux also has general policy hooks it applies to all
invocations of setxattr.

So I think to really fix this we need to separate the cases of is this
your security modules attribute from general policy checks added by the
security modules.  Perhaps something like this for
security_inode_setxattr:

Hmm.  Looking at least ima also has the distinction between protecting
it's own xattr writes and running generaly security module policy on
xattr writes.

int security_inode_setxattr(struct dentry *dentry, const char *name,
			    const void *value, size_t size, int flags)
{
	int ret = 0;

	if (unlikely(IS_PRIVATE(d_backing_inode(dentry))))
		return 0;

	if (strncmp(name, XATTR_SECURITY_PREFIX,
			sizeof(XATTR_SECURITY_PREFIX) - 1) == 0) {
		/* Call the security modules and see if they all return
                 * -EOPNOTSUPP if so apply the default permission
                 * check of ns_capable(dentry->d_sb->s_user_ns, CAP_SYS_ADMIN)
                 * otherwise if one of the security modules supports
		 * this attribute (signaled by returning something other
		 * -EOPNOTSUPP) then set ret to that result.
                 *
                 * The security modules include at least smack, selinux,
		 * commoncap, ima, and evm.
                 */
                 ret = magic_inode_protect_setxattr(dentry, name, value, size);
        }
	if (ret)
		return ret;

        /* Run all of the security module policy against this setxattr call */
        return magic_inode_policy_setxattr(dentry, name, value, size);
}

Eric

From ebiederm at xmission.com  Sat Sep 30 20:40:43 2017
From: ebiederm at xmission.com (Eric W. Biederman)
Date: Sat, 30 Sep 2017 15:40:43 -0500
Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr
	hooks behave
In-Reply-To: <db1c58f3-5a01-5276-eba7-5aac7cdcbcf5@schaufler-ca.com> (Casey
	Schaufler's message of "Sat, 30 Sep 2017 10:01:48 -0700")
References: <87tvzmqwoi.fsf@xmission.com>
	<1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>
	<1506694737.5571.9.camel@tycho.nsa.gov>
	<6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com>
	<87vak0ma00.fsf@xmission.com>
	<db1c58f3-5a01-5276-eba7-5aac7cdcbcf5@schaufler-ca.com>
Message-ID: <87d167ncms.fsf@xmission.com>

Casey Schaufler <casey at schaufler-ca.com> writes:

> On 9/30/2017 9:22 AM, Eric W. Biederman wrote:
>> Casey Schaufler <casey at schaufler-ca.com> writes:
>>
>>> On 9/29/2017 7:18 AM, Stephen Smalley wrote:
>>>> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote:
>>>>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote:
>>>>>> It looks like once upon a time a long time ago selinux copied code
>>>>>> from cap_inode_removexattr and cap_inode_setxattr into
>>>>>> selinux_inode_setotherxattr.??However the code has now diverged and
>>>>>> selinux is implementing a policy that is quite different than
>>>>>> cap_inode_setxattr and cap_inode_removexattr especially when it
>>>>>> comes
>>>>>> to the security.capable xattr.
>>>>> What leads you to believe that this isn't intentional?
>>>>> It's most likely the case that this change occurred as
>>>>> part of the first round module stacking change. What behavior
>>>>> do you see that you're unhappy with?
>>>>>
>>>>>> To keep things working
>>>>> Which "things"? How are they not "working"?
>>>>>
>>>>>> ?and to make the comments in security/security.c
>>>>>> correct when the xattr is securit.capable, call cap_inode_setxattr
>>>>>> or cap_inode_removexattr as appropriate.
>>>>>>
>>>>>> I suspect there is a larger conversation to be had here but this
>>>>>> is enough to keep selinux from implementing a non-sense hard coded
>>>>>> policy that breaks other parts of the kernel.
>>>>> Specifics, please. Since I can't guess what problem you've
>>>>> encountered I can't tell if it's here, in the infrastructure,
>>>>> or in your perception of what constitutes "broken".
>>>>>
>>>>>> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>>>>>> ---
>>>>>> ?security/selinux/hooks.c | 6 ++++++
>>>>>> ?1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>>>>> index f5d304736852..edf4bd292dc7 100644
>>>>>> --- a/security/selinux/hooks.c
>>>>>> +++ b/security/selinux/hooks.c
>>>>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct
>>>>>> dentry *dentry, const char *name,
>>>>>> ?	u32 newsid, sid = current_sid();
>>>>>> ?	int rc = 0;
>>>>>> ?
>>>>>> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>>>>>> +		return cap_inode_setxattr(dentry, name, value,
>>>>>> size, flags);
>>>>>> +
>>>>> No. Don't even think of contemplating considering embedding the cap
>>>>> attribute check in the SELinux code. cap_inode_setxattr() is called
>>>>> in
>>>>> the infrastructure.
>>>> Except that it isn't, not if any other security module is enabled and
>>>> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when
>>>> setting security.selinux or security.SMACK*.
>>> OK. Yes, this bit of the infrastructure is some of the
>>> worst I've done in a long time. This is a case where we
>>> already need special case stacking infrastructure. It looks
>>> like we'll have to separate setting the cap attribute from
>>> checking the cap state in order to make this work. In any
>>> case, the security_inode_setxattr() code is where the change
>>> belongs. There will likely be fallout changes in the modules,
>>> including the cap module.
>>> ?
>>>
>>>> An alternative approach to fixing this would be to change the cap
>>>> functions to only apply their checks if setting the capability
>>>> attribute and defer any checks on other security.* attributes to either
>>>> the security framework or the other security modules.  Then the
>>>> framework could always call all the modules on the inode_setxattr and
>>>> inode_removexattr hooks as with other hooks.  The security framework
>>>> would then need to ensure that a check is still applied when setting
>>>> security.* attributes if it isn't already handled by one of the enabled
>>>> security modules, as you don't want unprivileged userspace to be able
>>>> to set arbitrary security.foo attributes or to set up security.selinux
>>>> or security.SMACK* attributes if those modules happen to be disabled.
>>> Agreed. This isn't a two line change. Grumble.
>>>
>>> I can guess at what the problem might be, but I hate making
>>> assumptions when I go to fix a problem. I will start looking
>>> at a patch, but it would really help if I could say for sure
>>> what I'm out to accomplish. It may be obvious to the casual
>>> observer, but that description has not been applied to me very
>>> often.
>> Apologies for the delayed reply.
>>
>> I am looking at security_inode_setxattr.
>>
>> For setting attributes in the security.* the generic code in fs/xattr.c
>> applies no permission checks.
>>
>> Each security module that implements an xattr in security.* then imposes
>> it's own policy on it's own attribute.
>>
>> For smack the basic rule is smack_privileged(CAP_MAC_ADMIN).
>> For selinux the basic rule is inode_or_owner_capable(inode).
>> For commoncap the basic rule is capable_wrt_inode_uidgid(inode, CAP_SETFCAP).
>>
>> commoncap also applies a default policity to setting security.* xattrs.
>> ns_capable(dentry->d_sb->s_userns, CAP_SYS_ADMIN).
>>
>> smack reuses that default policy by calling cap_inode_setxattr if it
>> isn't a smack security.* xattr.
>>
>> selinux has what looks like an old copy of the commoncap checks for
>> the security.* in selinux_inode_setotherxattr.  Testing for
>> capable(CAP_SETFCAP) for security.capable and capable(CAP_SYS_ADMIN)
>> for the others.
>>
>> With the added complication that selinux calls
>> selinux_inode_setotherxattr also for the remove_xattr case.  So fixing
>> this in selinux_inode_setotherxattr is not appropriate.
>>
>> I believe selinux also has general policy hooks it applies to all
>> invocations of setxattr.
>>
>> So I think to really fix this we need to separate the cases of is this
>> your security modules attribute from general policy checks added by the
>> security modules.  Perhaps something like this for
>> security_inode_setxattr:
>>
>> Hmm.  Looking at least ima also has the distinction between protecting
>> it's own xattr writes and running generaly security module policy on
>> xattr writes.
>>
>> int security_inode_setxattr(struct dentry *dentry, const char *name,
>> 			    const void *value, size_t size, int flags)
>> {
>> 	int ret = 0;
>>
>> 	if (unlikely(IS_PRIVATE(d_backing_inode(dentry))))
>> 		return 0;
>>
>> 	if (strncmp(name, XATTR_SECURITY_PREFIX,
>> 			sizeof(XATTR_SECURITY_PREFIX) - 1) == 0) {
>> 		/* Call the security modules and see if they all return
>>                  * -EOPNOTSUPP if so apply the default permission
>>                  * check of ns_capable(dentry->d_sb->s_user_ns, CAP_SYS_ADMIN)
>>                  * otherwise if one of the security modules supports
>> 		 * this attribute (signaled by returning something other
>> 		 * -EOPNOTSUPP) then set ret to that result.
>>                  *
>>                  * The security modules include at least smack, selinux,
>> 		 * commoncap, ima, and evm.
>>                  */
>>                  ret = magic_inode_protect_setxattr(dentry, name, value, size);
>>         }
>> 	if (ret)
>> 		return ret;
>>
>>         /* Run all of the security module policy against this setxattr call */
>>         return magic_inode_policy_setxattr(dentry, name, value, size);
>> }
>>
>> Eric
>
> Yup, that's pretty much what I'm thinking. It's unfortunate
> that the magic_ API isn't fully implemented. There's going to
> be a good deal of code surgery instead. Is there an observed
> problem today? This is going to have to get addressed for stacking,
> so if there isn't a behavioral issue that impacts something real
> I would like to defer spending significant time on it. Do you have
> a case where this is not working correctly?

Merged as of 4.14-rc1 is the support for user namespace root to set
sercurity.capable.  This fails when selinux is loaded.

removexattr has the same problem and the code is a little less
convoluted in that case.

Not being able to set the capability when you should be able to is
very noticable.  Like running into a brick wall noticable.

Which is where the minimal patch for selinux comes in.  I think it
solves the exact case in question, even if it isn't the perfect long
term solution.

Eric


From casey at schaufler-ca.com  Sat Sep 30 23:22:12 2017
From: casey at schaufler-ca.com (Casey Schaufler)
Date: Sat, 30 Sep 2017 16:22:12 -0700
Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr
	hooks behave
In-Reply-To: <87d167ncms.fsf@xmission.com>
References: <87tvzmqwoi.fsf@xmission.com>
	<1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com>
	<1506694737.5571.9.camel@tycho.nsa.gov>
	<6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com>
	<87vak0ma00.fsf@xmission.com>
	<db1c58f3-5a01-5276-eba7-5aac7cdcbcf5@schaufler-ca.com>
	<87d167ncms.fsf@xmission.com>
Message-ID: <bf18e641-91ed-0d75-f514-c059b5dfbb14@schaufler-ca.com>

On 9/30/2017 1:40 PM, Eric W. Biederman wrote:
> Casey Schaufler <casey at schaufler-ca.com> writes:
>
>> On 9/30/2017 9:22 AM, Eric W. Biederman wrote:
>>> Casey Schaufler <casey at schaufler-ca.com> writes:
>>>
>>>> On 9/29/2017 7:18 AM, Stephen Smalley wrote:
>>>>> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote:
>>>>>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote:
>>>>>>> It looks like once upon a time a long time ago selinux copied code
>>>>>>> from cap_inode_removexattr and cap_inode_setxattr into
>>>>>>> selinux_inode_setotherxattr.??However the code has now diverged and
>>>>>>> selinux is implementing a policy that is quite different than
>>>>>>> cap_inode_setxattr and cap_inode_removexattr especially when it
>>>>>>> comes
>>>>>>> to the security.capable xattr.
>>>>>> What leads you to believe that this isn't intentional?
>>>>>> It's most likely the case that this change occurred as
>>>>>> part of the first round module stacking change. What behavior
>>>>>> do you see that you're unhappy with?
>>>>>>
>>>>>>> To keep things working
>>>>>> Which "things"? How are they not "working"?
>>>>>>
>>>>>>> ?and to make the comments in security/security.c
>>>>>>> correct when the xattr is securit.capable, call cap_inode_setxattr
>>>>>>> or cap_inode_removexattr as appropriate.
>>>>>>>
>>>>>>> I suspect there is a larger conversation to be had here but this
>>>>>>> is enough to keep selinux from implementing a non-sense hard coded
>>>>>>> policy that breaks other parts of the kernel.
>>>>>> Specifics, please. Since I can't guess what problem you've
>>>>>> encountered I can't tell if it's here, in the infrastructure,
>>>>>> or in your perception of what constitutes "broken".
>>>>>>
>>>>>>> Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>>>>>>> ---
>>>>>>> ?security/selinux/hooks.c | 6 ++++++
>>>>>>> ?1 file changed, 6 insertions(+)
>>>>>>>
>>>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>>>>>> index f5d304736852..edf4bd292dc7 100644
>>>>>>> --- a/security/selinux/hooks.c
>>>>>>> +++ b/security/selinux/hooks.c
>>>>>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct
>>>>>>> dentry *dentry, const char *name,
>>>>>>> ?	u32 newsid, sid = current_sid();
>>>>>>> ?	int rc = 0;
>>>>>>> ?
>>>>>>> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>>>>>>> +		return cap_inode_setxattr(dentry, name, value,
>>>>>>> size, flags);
>>>>>>> +
>>>>>> No. Don't even think of contemplating considering embedding the cap
>>>>>> attribute check in the SELinux code. cap_inode_setxattr() is called
>>>>>> in
>>>>>> the infrastructure.
>>>>> Except that it isn't, not if any other security module is enabled and
>>>>> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when
>>>>> setting security.selinux or security.SMACK*.
>>>> OK. Yes, this bit of the infrastructure is some of the
>>>> worst I've done in a long time. This is a case where we
>>>> already need special case stacking infrastructure. It looks
>>>> like we'll have to separate setting the cap attribute from
>>>> checking the cap state in order to make this work. In any
>>>> case, the security_inode_setxattr() code is where the change
>>>> belongs. There will likely be fallout changes in the modules,
>>>> including the cap module.
>>>> ?
>>>>
>>>>> An alternative approach to fixing this would be to change the cap
>>>>> functions to only apply their checks if setting the capability
>>>>> attribute and defer any checks on other security.* attributes to either
>>>>> the security framework or the other security modules.  Then the
>>>>> framework could always call all the modules on the inode_setxattr and
>>>>> inode_removexattr hooks as with other hooks.  The security framework
>>>>> would then need to ensure that a check is still applied when setting
>>>>> security.* attributes if it isn't already handled by one of the enabled
>>>>> security modules, as you don't want unprivileged userspace to be able
>>>>> to set arbitrary security.foo attributes or to set up security.selinux
>>>>> or security.SMACK* attributes if those modules happen to be disabled.
>>>> Agreed. This isn't a two line change. Grumble.
>>>>
>>>> I can guess at what the problem might be, but I hate making
>>>> assumptions when I go to fix a problem. I will start looking
>>>> at a patch, but it would really help if I could say for sure
>>>> what I'm out to accomplish. It may be obvious to the casual
>>>> observer, but that description has not been applied to me very
>>>> often.
>>> Apologies for the delayed reply.
>>>
>>> I am looking at security_inode_setxattr.
>>>
>>> For setting attributes in the security.* the generic code in fs/xattr.c
>>> applies no permission checks.
>>>
>>> Each security module that implements an xattr in security.* then imposes
>>> it's own policy on it's own attribute.
>>>
>>> For smack the basic rule is smack_privileged(CAP_MAC_ADMIN).
>>> For selinux the basic rule is inode_or_owner_capable(inode).
>>> For commoncap the basic rule is capable_wrt_inode_uidgid(inode, CAP_SETFCAP).
>>>
>>> commoncap also applies a default policity to setting security.* xattrs.
>>> ns_capable(dentry->d_sb->s_userns, CAP_SYS_ADMIN).
>>>
>>> smack reuses that default policy by calling cap_inode_setxattr if it
>>> isn't a smack security.* xattr.
>>>
>>> selinux has what looks like an old copy of the commoncap checks for
>>> the security.* in selinux_inode_setotherxattr.  Testing for
>>> capable(CAP_SETFCAP) for security.capable and capable(CAP_SYS_ADMIN)
>>> for the others.
>>>
>>> With the added complication that selinux calls
>>> selinux_inode_setotherxattr also for the remove_xattr case.  So fixing
>>> this in selinux_inode_setotherxattr is not appropriate.
>>>
>>> I believe selinux also has general policy hooks it applies to all
>>> invocations of setxattr.
>>>
>>> So I think to really fix this we need to separate the cases of is this
>>> your security modules attribute from general policy checks added by the
>>> security modules.  Perhaps something like this for
>>> security_inode_setxattr:
>>>
>>> Hmm.  Looking at least ima also has the distinction between protecting
>>> it's own xattr writes and running generaly security module policy on
>>> xattr writes.
>>>
>>> int security_inode_setxattr(struct dentry *dentry, const char *name,
>>> 			    const void *value, size_t size, int flags)
>>> {
>>> 	int ret = 0;
>>>
>>> 	if (unlikely(IS_PRIVATE(d_backing_inode(dentry))))
>>> 		return 0;
>>>
>>> 	if (strncmp(name, XATTR_SECURITY_PREFIX,
>>> 			sizeof(XATTR_SECURITY_PREFIX) - 1) == 0) {
>>> 		/* Call the security modules and see if they all return
>>>                  * -EOPNOTSUPP if so apply the default permission
>>>                  * check of ns_capable(dentry->d_sb->s_user_ns, CAP_SYS_ADMIN)
>>>                  * otherwise if one of the security modules supports
>>> 		 * this attribute (signaled by returning something other
>>> 		 * -EOPNOTSUPP) then set ret to that result.
>>>                  *
>>>                  * The security modules include at least smack, selinux,
>>> 		 * commoncap, ima, and evm.
>>>                  */
>>>                  ret = magic_inode_protect_setxattr(dentry, name, value, size);
>>>         }
>>> 	if (ret)
>>> 		return ret;
>>>
>>>         /* Run all of the security module policy against this setxattr call */
>>>         return magic_inode_policy_setxattr(dentry, name, value, size);
>>> }
>>>
>>> Eric
>> Yup, that's pretty much what I'm thinking. It's unfortunate
>> that the magic_ API isn't fully implemented. There's going to
>> be a good deal of code surgery instead. Is there an observed
>> problem today? This is going to have to get addressed for stacking,
>> so if there isn't a behavioral issue that impacts something real
>> I would like to defer spending significant time on it. Do you have
>> a case where this is not working correctly?
> Merged as of 4.14-rc1 is the support for user namespace root to set
> sercurity.capable.  This fails when selinux is loaded.

OK. Is the failure unique to SELinux, or does it fail with
Smack as well?

> removexattr has the same problem and the code is a little less
> convoluted in that case.

Right. Because removexattr is a simpler situation.

> Not being able to set the capability when you should be able to is
> very noticable.  Like running into a brick wall noticable.

Ah, now you've identified the problem. Yes, I would agree that you've
uncovered an undesirable behavior.

> Which is where the minimal patch for selinux comes in.  I think it
> solves the exact case in question, even if it isn't the perfect long
> term solution.

If the problem is unique to SELinux I can see your logic. If it
isn't, that is, if it's also a problem with any other security
module, there either needs to be a fix for that/those module/s
as well or a "real" fix.

I'm not opposed to the SELinux short term fix if you can say
that that's the only module with the problem.

>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


.