From lgdt at cpke.com Mon Sep 4 14:20:54 2017 From: lgdt at cpke.com (=?utf-8?Q?=E8=BD=A8=E6=8C=87?=) Date: 4 Sep 2017 22:20:54 +0800 Subject: =?utf-8?B?Y29udGFpbmVyc0BsaXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZ+eZvOWYjOS7o+mWizEwMCXnnJ/lmIwxMzYxMDA2NTAzNueGimNvbnRhaW5lcnNAbGlzdHMubGludXgtZm91bmRhdGlvbi5vcmc=?= Message-ID: ???????????13610065036 ??????????????????????????????????????????????????????? ?????????????????????????????? ?????????????????????????????????100%?? ??????? ??????????????????????? ??????? ????????????13610065036????????????????????? From stgraber at ubuntu.com Mon Sep 4 22:28:57 2017 From: stgraber at ubuntu.com (=?iso-8859-1?Q?St=E9phane?= Graber) Date: Mon, 4 Sep 2017 18:28:57 -0400 Subject: Linux Plumbers containers micro-conference CFP In-Reply-To: <20170727182929.t5k665eceewup2xs@castiana> References: <20170705193033.puyhniz7rvoo572f@castiana> <20170727182929.t5k665eceewup2xs@castiana> Message-ID: <20170904222857.uzafhjdqxxli2e5k@castiana> On Thu, Jul 27, 2017 at 02:29:29PM -0400, St?phane Graber wrote: > On Wed, Jul 05, 2017 at 03:30:34PM -0400, St?phane Graber wrote: > > Hey there, > > > > Linux Plumbers 2017 will be held in Los Angeles, CA between the 13th and > > 15th of September 2017 including the usual containers micro-conference. > > > > This is a great place to catch up with fellow maintainers and users and > > to discuss issues that affect us all. > > > > You can find the more detailed CFP here: > > https://discuss.linuxcontainers.org/t/containers-micro-conference-at-linux-plumbers-2017/262 > > > > CFP closes on the 4th of August 2017. > > > > Looking forward to seeing you there! > > This is a reminder that we're still looking for more submissions for the > containers micro-conference at Linux Plumbers this fall in Los Angeles. > > We're looking for short talks/demos as well as discussion topics for our > audience of kernel developers, container runtime maintainers and > container users! > > > Proposals can be submitted here: https://linuxplumbersconf.org/2017/ocw/events/LPC2017/proposals/new > > See you in Los Angeles! > > St?phane > > PS: Forwarding to container projects mailing-lists would be appreciated! Hey there, We have now published the schedule for next week's micro-conference: https://discuss.linuxcontainers.org/t/containers-micro-conference-schedule/490 See you in Los Angeles! St?phane -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: not available URL: From stgraber at ubuntu.com Wed Sep 6 05:37:38 2017 From: stgraber at ubuntu.com (=?iso-8859-1?Q?St=E9phane?= Graber) Date: Wed, 6 Sep 2017 01:37:38 -0400 Subject: LXC 2.1 has been released Message-ID: <20170906053738.j3gsfcmlzbipkgbv@castiana> Hey there, After 1.5 years of development, we've finally tagged a new feature release of LXC. LXC 2.1 is a normal feature release coming with a year of upstream support. For production environments, you should stick to LXC 2.0 which benefits from much longer support. This new release of LXC introduces a few new security features and various improvements to the LXC tools and templates. But more importantly, it's a transitional release ahead of LXC 3.0 to be released early next year. LXC 3.0 will deprecate a number of tools and change a large number of the existing configuration keys. LXC 2.1 will issue warnings whenever the user is using something which will be removed or renamed in the upcoming LXC 3.0. An lxc-update-config tool is also provided to automatically convert your containers' configurations to the new format. More details about LXC 2.1 can be found in the release announcement: https://discuss.linuxcontainers.org/t/lxc-2-1-has-been-released/487 -- St?phane Graber Ubuntu developer http://www.ubuntu.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: not available URL: From serge at hallyn.com Wed Sep 6 14:03:42 2017 From: serge at hallyn.com (Serge E. Hallyn) Date: Wed, 6 Sep 2017 09:03:42 -0500 Subject: [PATCH 2/9] Implement containers as kernel objects In-Reply-To: <20170818080300.GQ7187@madcap2.tricolour.ca> References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk> <149547016213.10599.1969443294414531853.stgit@warthog.procyon.org.uk> <20170814054711.GB29957@madcap2.tricolour.ca> <20170818080300.GQ7187@madcap2.tricolour.ca> Message-ID: <20170906140341.GA8729@mail.hallyn.com> Quoting Richard Guy Briggs (rgb at redhat.com): ... > > I believe we are going to need a container ID to container definition > > (namespace, etc.) mapping mechanism regardless of if the container ID > > is provided by userspace or a kernel generated serial number. This > > mapping should be recorded in the audit log when the container ID is > > created/defined. > > Agreed. > > > > As was suggested in one of the previous threads, if there are any events not > > > associated with a task (incoming network packets) we log the namespace ID and > > > then only concern ourselves with its container serial number or container name > > > once it becomes associated with a task at which point that tracking will be > > > more important anyways. > > > > Agreed. After all, a single namespace can be shared between multiple > > containers. For those security officers who need to track individual > > events like this they will have the container ID mapping information > > in the logs as well so they should be able to trace the unassociated > > event to a set of containers. > > > > > I'm not convinced that a userspace or kernel generated UUID is that useful > > > since they are large, not human readable and may not be globally unique given > > > the "pets vs cattle" direction we are going with potentially identical > > > conditions in hosts or containers spawning containers, but I see no need to > > > restrict them. > > > > From a kernel perspective I think an int should suffice; after all, > > you can't have more containers then you have processes. If the > > container engine requires something more complex, it can use the int > > as input to its own mapping function. > > PIDs roll over. That already causes some ambiguity in reporting. If a > system is constantly spawning and reaping containers, especially > single-process containers, I don't want to have to worry about that ID > rolling to keep track of it even though there should be audit records of > the spawn and death of each container. There isn't significant cost > added here compared with some of the other overhead we're dealing with. Strawman proposal: 1. Each clone/unshare/setns involving a namespace type generates an audit message along the lines of: PID 9512 (pid in init_pid_ns) in auditnsid 00000001 cloned CLONE_NEWNS|CLONE_NEWNET new auditnsid: 00000002 associated namespaces: (list of all namespace filesystem inode numbers) 2. Userspace (i.e. the container logging deamon here) can watch the audit log for all messages relating to auditnsid 00000002. Presumably there will be messages along the lines of "PID 9513 in auditnsid 00000002 cloned...". The container logging daemon can track those messages and add the new auditnsids to the list it watches. 3. If a container is migrated (checkpointed and restored here or elsewhere), userspace can just follow the appropriate logs for the new containers. Userspace does not ever *request* a auditnsid. They are ephemeral, just a tool to track the namespaces through the audit log. They are however guaranteed to never be re-used until reboot. (Feels like someone must have proposed this before) -serge From paul at paul-moore.com Fri Sep 8 20:02:25 2017 From: paul at paul-moore.com (Paul Moore) Date: Fri, 8 Sep 2017 16:02:25 -0400 Subject: [PATCH 2/9] Implement containers as kernel objects In-Reply-To: <20170818080300.GQ7187@madcap2.tricolour.ca> References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk> <149547016213.10599.1969443294414531853.stgit@warthog.procyon.org.uk> <20170814054711.GB29957@madcap2.tricolour.ca> <20170818080300.GQ7187@madcap2.tricolour.ca> Message-ID: On Fri, Aug 18, 2017 at 4:03 AM, Richard Guy Briggs wrote: > On 2017-08-16 18:21, Paul Moore wrote: >> On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs wrote: >> > Hi David, >> > >> > I wanted to respond to this thread to attempt some constructive feedback, >> > better late than never. I had a look at your fsopen/fsmount() patchset(s) to >> > support this patchset which was interesting, but doesn't directly affect my >> > work. The primary patch of interest to the audit kernel folks (Paul Moore and >> > me) is this patch while the rest of the patchset is interesting, but not likely >> > to directly affect us. This patch has most of what we need to solve our >> > problem. >> > >> > Paul and I agree that audit is going to have a difficult time identifying >> > containers or even namespaces without some change to the kernel. The audit >> > subsystem in the kernel needs at least a basic clue about which container >> > caused an event to be able to report this at the appropriate level and ignore >> > it at other levels to avoid a DoS. >> >> While there is some increased risk of "death by audit", this is really >> only an issue once we start supporting multiple audit daemons; simply >> associating auditable events with the container that triggered them >> shouldn't add any additional overhead (I hope). For a number of use >> cases, a single auditd running outside the containers, but recording >> all their events with some type of container attribution will be >> sufficient. This is step #1. >> >> However, we will obviously want to go a bit further and support >> multiple audit daemons on the system to allow containers to >> record/process their own events (side note: the non-container auditd >> instance will still see all the events). There are a number of ways >> we could tackle this, both via in-kernel and in-userspace record >> routing, each with their own pros/cons. However, how this works is >> going to be dependent on how we identify containers and track their >> audit events: the bits from step #1. For this reason I'm not really >> interested in worrying about the multiple auditd problem just yet; >> it's obviously important, and something to keep in mind while working >> up a solution, but it isn't something we should focus on right now. >> >> > We also agree that there will need to be some sort of trigger from userspace to >> > indicate the creation of a container and its allocated resources and we're not >> > really picky how that is done, such as a clone flag, a syscall or a sysfs write >> > (or even a read, I suppose), but there will need to be some permission >> > restrictions, obviously. (I'd like to see capabilities used for this by adding >> > a specific container bit to the capabilities bitmask.) >> >> To be clear, from an audit perspective I think the only thing we would >> really care about controlling access to is the creation and assignment >> of a new audit container ID/token, not necessarily the container >> itself. It's a small point, but an important one I think. >> >> > I doubt we will be able to accomodate all definitions or concepts of a >> > container in a timely fashion. We'll need to start somewhere with a minimum >> > definition so that we can get traction and actually move forward before another >> > compelling shared kernel microservice method leaves our entire community >> > behind. I'd like to declare that a container is a full set of cloned >> > namespaces, but this is inefficient, overly constricting and unnecessary for >> > our needs. If we could agree on a minimum definition of a container (which may >> > have only one specific cloned namespace) then we have something on which to >> > build. I could even see a container being defined by a trigger sent from >> > userspace about a process (task) from which all its children are considered to >> > be within that container, subject to further nesting. >> >> I really would prefer if we could avoid defining the term "container". >> Even if we manage to get it right at this particular moment, we will >> surely be made fools a year or two from now when things change. At >> the very least lets avoid a rigid definition of container, I'll >> concede that we will probably need to have some definition simply so >> we can implement something, I just don't want the design or >> implementation to depend on a particular definition. >> >> This comment is jumping ahead a bit, but from an audit perspective I >> think we handle this by emitting an audit record whenever a container >> ID is created which describes it as the kernel sees it; as of now that >> probably means a list of namespace IDs. Richard mentions this in his >> email, I just wanted to make it clear that I think we should see this >> as a flexible mechanism. At the very least we will likely see a few >> more namespaces before the world moves on from containers. >> >> > In the simplest usable model for audit, if a container (definition implies and) >> > starts a PID namespace, then the container ID could simply be the container's >> > "init" process PID in the initial PID namespace. This assumes that as soon as >> > that process vanishes, that entire container and all its children are killed >> > off (which you've done). There may be some container orchestration systems >> > that don't use a unique PID namespace per container and that imposing this will >> > cause them challenges. >> >> I don't follow how this would cause challenges if the containers do >> not use a unique PID namespace; you are suggesting using the PID from >> in the context of the initial PID namespace, yes? > > The PID of the "init" process of a container (PID=1 inside container, > but PID=containerID from the initial PID namespace perspective). Yep. I still don't see how a container not creating a unique PID namespace presents a challenge here as the unique information would be taken from the initial PID namespace. However, based on some off-list discussions I expect this is going to be a non-issue in the next proposal. >> Regardless, I do worry that using a PID could potentially be a bit >> racy once we start jumping between kernel and userspace (audit >> configuration, logs, etc.). > > How do you think this could be racy? An event happenning before or as > the container has been defined? It's racy for the same reasons why we have the pid struct in the kernel. If the orchestrator is referencing things via a PID there is always some danger of a mixup. >> > If containers have at minimum a unique mount namespace then the root path >> > dentry inode device and inode number could be used, but there are likely better >> > identifiers. Again, there may be container orchestrators that don't use a >> > unique mount namespace per container and that imposing this will cause >> > challenges. >> > >> > I expect there are similar examples for each of the other namespaces. >> >> The PID case is a bit unique as each process is going to have a unique >> PID regardless of namespaces, but even that has some drawbacks as >> discussed above. As for the other namespaces, I agree that we can't >> rely on them (see my earlier comments). > > (In general can you specify which earlier comments so we can be sure to > what you are referring?) Really? How about the race condition concerns. Come on Richard ... >> > If we could pick one namespace type for consensus for which each container has >> > a unique instance of that namespace, we could use the dev/ino tuple from that >> > namespace as had originally been suggested by Aristeu Rozanski more than 4 >> > years ago as part of the set of namespace IDs. I had also attempted to >> > solve this problem by using the namespace' proc inode, then switched over to >> > generate a unique kernel serial number for each namespace and then went back to >> > namespace proc dev/ino once Al Viro implemented nsfs: >> > v1 https://lkml.org/lkml/2014/4/22/662 >> > v2 https://lkml.org/lkml/2014/5/9/637 >> > v3 https://lkml.org/lkml/2014/5/20/287 >> > v4 https://lkml.org/lkml/2014/8/20/844 >> > v5 https://lkml.org/lkml/2014/10/6/25 >> > v6 https://lkml.org/lkml/2015/4/17/48 >> > v7 https://lkml.org/lkml/2015/5/12/773 >> > >> > These patches don't use a container ID, but track all namespaces in use for an >> > event. This has the benefit of punting this tracking to userspace for some >> > other tool to analyse and determine to which container an event belongs. >> > This will use a lot of bandwidth in audit log files when a single >> > container ID that doesn't require nesting information to be complete >> > would be a much more efficient use of audit log bandwidth. >> >> Relying on a particular namespace to identify a containers is a >> non-starter from my perspective for all the reasons previously >> discussed. > > I'd rather not either and suspect there isn't much danger of it, but if > it is determined that there is one namespace in particular that is a > minimum requirement, I'd prefer to use that nsID instead of creating an > additional ID. > >> > If we rely only on the setting of arbitrary container names from userspace, >> > then we must provide a map or tree back to the initial audit domain for that >> > running kernel to be able to differentiate between potentially identical >> > container names assigned in a nested container system. If we assign a >> > container serial number sequentially (atomic64_inc) from the kernel on request >> > from userspace like the sessionID and log the creation with all nsIDs and the >> > parent container serial number and/or container name, the nesting is clear due >> > to lack of ambiguity in potential duplicate names in nesting. If a container >> > serial number is used, the tree of inheritance of nested containers can be >> > rebuilt from the audit records showing what containers were spawned from what >> > parent. >> >> I believe we are going to need a container ID to container definition >> (namespace, etc.) mapping mechanism regardless of if the container ID >> is provided by userspace or a kernel generated serial number. This >> mapping should be recorded in the audit log when the container ID is >> created/defined. > > Agreed. > >> > As was suggested in one of the previous threads, if there are any events not >> > associated with a task (incoming network packets) we log the namespace ID and >> > then only concern ourselves with its container serial number or container name >> > once it becomes associated with a task at which point that tracking will be >> > more important anyways. >> >> Agreed. After all, a single namespace can be shared between multiple >> containers. For those security officers who need to track individual >> events like this they will have the container ID mapping information >> in the logs as well so they should be able to trace the unassociated >> event to a set of containers. >> >> > I'm not convinced that a userspace or kernel generated UUID is that useful >> > since they are large, not human readable and may not be globally unique given >> > the "pets vs cattle" direction we are going with potentially identical >> > conditions in hosts or containers spawning containers, but I see no need to >> > restrict them. >> >> From a kernel perspective I think an int should suffice; after all, >> you can't have more containers then you have processes. If the >> container engine requires something more complex, it can use the int >> as input to its own mapping function. > > PIDs roll over. That already causes some ambiguity in reporting. If a > system is constantly spawning and reaping containers, especially > single-process containers, I don't want to have to worry about that ID > rolling to keep track of it even though there should be audit records of > the spawn and death of each container. There isn't significant cost > added here compared with some of the other overhead we're dealing with. Fine, make it a u64. I believe that's what I've been proposing in the off-list discussion if memory serves. A UUID or string are not acceptable from my perspective. Too big for the audit records and not really necessary anyway, a u64 should be just fine. ... and if anyone dares bring up that 640kb quote I swear I'll NACK all their patches for the next year :) >> > How do we deal with setns()? Once it is determined that action is permitted, >> > given the new combinaiton of namespaces and potential membership in a different >> > container, record the transition from one container to another including all >> > namespaces if the latter are a different subset than the target container >> > initial set. >> >> That is a fun one, isn't it? I think this is where the container >> ID-to-definition mapping comes into play. If setns() changes the >> process such that the existing container ID is no longer valid then we >> need to do a new lookup in the table to see if another container ID is >> valid; if no established container ID mappings are valid, the >> container ID becomes "undefined". > > Hopefully we can design this stuff so that container IDs are still valid > while that transition occurs. > >> paul moore > > - RGB > > -- > Richard Guy Briggs > Sr. S/W Engineer, Kernel Security, Base Operating Systems > Remote, Ottawa, Red Hat Canada > IRC: rgb, SunRaycer > Voice: +1.647.777.2635, Internal: (81) 32635 -- paul moore www.paul-moore.com From ebiederm at xmission.com Mon Sep 11 17:21:54 2017 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 11 Sep 2017 12:21:54 -0500 Subject: [GIT PULL] namespace updates for 4.14-rc1 Message-ID: <87mv61cfrh.fsf@xmission.com> Linus, Please pull the for-linus branch from the git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 076a9bcacfc7ccbc2b3fdf3bd490718f6b182419 signal/mips: Remove FPE_FIXME usage from mips Life has been busy and I have not gotten half as much done this round as I would have liked. I delayed it so that a minor conflict resolution with the mips tree could spend a little time in linux-next before I sent this pull request. This pull request includes two long delayed user namespace changes from Kirill Tkhai. It also includes a very useful change from Serge Hallyn that allows the security capability attribute to be used inside of user namespaces. The practical effect of this is people can now untar tarballs and install rpms in user namespaces. It had been suggested to generalize this and encode some of the namespace information information in the xattr name. Upon close inspection that makes the things that should be hard easy and the things that should be easy more expensive. Then there is my bugfix/cleanup for signal injection that removes the magic encoding of the siginfo union member from the kernel internal si_code. The mips folks reported the case where I had used FPE_FIXME me is impossible so I have remove FPE_FIXME from mips, while at the same time including a return statement in that case to keep gcc from complaining about unitialized variables. I almost finished the work to get make copy_siginfo_to_user a trivial copy to user. The code is available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3 But I did not have time/energy to get the code posted and reviewed before the merge window opened. I was able to see that the security excuse for just copying fields that we know are initialized doesn't work in practice there are buggy initializations that don't initialize the proper fields in siginfo. So we still sometimes copy unitialized data to userspace. Eric W. Biederman (11): signal/alpha: Document a conflict with SI_USER for SIGTRAP signal/ia64: Document a conflict with SI_USER with SIGFPE signal/sparc: Document a conflict with SI_USER with SIGFPE signal/mips: Document a conflict with SI_USER with SIGFPE signal/testing: Don't look for __SI_FAULT in userspace userns,pidns: Verify the userns for new pid namespaces fcntl: Don't use ambiguous SIG_POLL si_codes signal: Remove kernel interal si_code magic signal: Fix sending signals with siginfo mips/signal: In force_fcr31_sig return in the impossible case signal/mips: Remove FPE_FIXME usage from mips Kirill Tkhai (2): security: Use user_namespace::level to avoid redundant iterations in cap_capable() prctl: Allow local CAP_SYS_ADMIN changing exe_file Serge E. Hallyn (1): Introduce v3 namespaced file capabilities arch/alpha/include/uapi/asm/siginfo.h | 14 ++ arch/alpha/kernel/traps.c | 6 +- arch/arm64/kernel/signal32.c | 23 +-- arch/blackfin/include/uapi/asm/siginfo.h | 30 ++- arch/frv/include/uapi/asm/siginfo.h | 2 +- arch/ia64/include/uapi/asm/siginfo.h | 21 +- arch/ia64/kernel/signal.c | 17 +- arch/ia64/kernel/traps.c | 4 +- arch/mips/include/uapi/asm/siginfo.h | 4 +- arch/mips/kernel/signal32.c | 19 +- arch/mips/kernel/traps.c | 2 +- arch/parisc/kernel/signal32.c | 31 ++- arch/powerpc/kernel/signal_32.c | 20 +- arch/s390/kernel/compat_signal.c | 32 ++- arch/sparc/include/uapi/asm/siginfo.h | 9 +- arch/sparc/kernel/signal32.c | 16 +- arch/sparc/kernel/traps_32.c | 2 +- arch/sparc/kernel/traps_64.c | 2 +- arch/tile/include/uapi/asm/siginfo.h | 4 +- arch/tile/kernel/compat_signal.c | 18 +- arch/tile/kernel/traps.c | 2 +- arch/x86/kernel/signal_compat.c | 21 +- fs/fcntl.c | 13 +- fs/signalfd.c | 22 +- fs/xattr.c | 6 + include/linux/capability.h | 2 + include/linux/security.h | 2 + include/linux/signal.h | 22 ++ include/linux/user_namespace.h | 9 +- include/uapi/asm-generic/siginfo.h | 115 +++++------ include/uapi/linux/capability.h | 22 +- kernel/exit.c | 4 +- kernel/pid_namespace.c | 4 + kernel/ptrace.c | 6 +- kernel/signal.c | 72 +++++-- kernel/sys.c | 8 +- kernel/user_namespace.c | 20 +- security/commoncap.c | 277 ++++++++++++++++++++++++-- tools/testing/selftests/x86/mpx-mini-test.c | 3 +- tools/testing/selftests/x86/protection_keys.c | 13 +- 40 files changed, 622 insertions(+), 297 deletions(-) Eric From rgb at redhat.com Wed Sep 13 17:13:28 2017 From: rgb at redhat.com (Richard Guy Briggs) Date: Wed, 13 Sep 2017 13:13:28 -0400 Subject: RFC: Audit Kernel Container IDs Message-ID: <20170913171328.GP3405@madcap2.tricolour.ca> Containers are a userspace concept. The kernel knows nothing of them. The Linux audit system needs a way to be able to track the container provenance of events and actions. Audit needs the kernel's help to do this. Since the concept of a container is entirely a userspace concept, a trigger signal from the userspace container orchestration system initiates this. This will define a point in time and a set of resources associated with a particular container with an audit container ID. The trigger is a pseudo filesystem (proc, since PID tree already exists) write of a u64 representing the container ID to a file representing a process that will become the first process in a new container. This might place restrictions on mount namespaces required to define a container, or at least careful checking of namespaces in the kernel to verify permissions of the orchestrator so it can't change its own container ID. A bind mount of nsfs may be necessary in the container orchestrator's mntNS. Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo filesystem to have this action permitted. At that time, record the child container's user-supplied 64-bit container identifier along with the child container's first process (which may become the container's "init" process) process ID (referenced from the initial PID namespace), all namespace IDs (in the form of a nsfs device number and inode number tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying op=$action field. Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid container ID present on an auditable action or event. Forked and cloned processes inherit their parent's container ID, referenced in the process' audit_context struct. Log the creation of every namespace, inheriting/adding its spawning process' containerID(s), if applicable. Include the spawning and spawned namespace IDs (device and inode number tuples). [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] Note: At this point it appears only network namespaces may need to track container IDs apart from processes since incoming packets may cause an auditable event before being associated with a process. Log the destruction of every namespace when it is no longer used by any process, include the namespace IDs (device and inode number tuples). [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) the parent and child namespace IDs for any changes to a process' namespaces. [setns(2)] Note: It may be possible to combine AUDIT_NS_* record formats and distinguish them with an op=$action field depending on the fields required for each message type. A process can be moved from one container to another by using the container assignment method outlined above a second time. When a container ceases to exist because the last process in that container has exited and hence the last namespace has been destroyed and its refcount dropping to zero, log the fact. (This latter is likely needed for certification accountability.) A container object may need a list of processes and/or namespaces. A namespace cannot directly migrate from one container to another but could be assigned to a newly spawned container. A namespace can be moved from one container to another indirectly by having that namespace used in a second process in another container and then ending all the processes in the first container. Feedback please. - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From carlos at redhat.com Wed Sep 13 19:33:52 2017 From: carlos at redhat.com (Carlos O'Donell) Date: Wed, 13 Sep 2017 14:33:52 -0500 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <20170913171328.GP3405@madcap2.tricolour.ca> References: <20170913171328.GP3405@madcap2.tricolour.ca> Message-ID: <9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com> On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > Containers are a userspace concept. The kernel knows nothing of them. I am looking at this RFC from a userspace perspective, particularly from the loader's point of view and the unshare syscall and the semantics that arise from the use of it. At a high level what you are doing is providing a way to group, without hierarchy, processes and namespaces. The processes can move between container's if they have CAP_CONTAINER_ADMIN and can open and write to a special proc file. * With unshare a thread may dissociate part of its execution context and therefore see a distinct mount namespace. When you say "process" in this particular RFC do you exclude the fact that a thread might be in a distinct container from the rest of the threads in the process? > The Linux audit system needs a way to be able to track the container > provenance of events and actions. Audit needs the kernel's help to do > this. * Why does the Linux audit system need to tracker container provenance? - How does it help to provide better audit messages? - Is it be enough to list the namespace that a process occupies? * Why does it need the kernel's help? - Is there a race condition that is only fixable with kernel support? - Or is it easier with kernel help but not required? Providing background on these questions would help clarify the design requirements. > Since the concept of a container is entirely a userspace concept, a > trigger signal from the userspace container orchestration system > initiates this. This will define a point in time and a set of resources > associated with a particular container with an audit container ID. Please don't use the word 'signal', I suggest 'register' since you are writing to a filesystem. > The trigger is a pseudo filesystem (proc, since PID tree already exists) > write of a u64 representing the container ID to a file representing a > process that will become the first process in a new container. > This might place restrictions on mount namespaces required to define a > container, or at least careful checking of namespaces in the kernel to > verify permissions of the orchestrator so it can't change its own > container ID. > A bind mount of nsfs may be necessary in the container orchestrator's > mntNS. > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > filesystem to have this action permitted. At that time, record the > child container's user-supplied 64-bit container identifier along with What is a "child container?" Containers don't have any hierarchy. I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents your continued operation as we have today? > the child container's first process (which may become the container's > "init" process) process ID (referenced from the initial PID namespace), > all namespace IDs (in the form of a nsfs device number and inode number > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying > op=$action field. What kind of requirement is there on the first tid/pid registering the container ID? What if the 8th tid/pid does the registration? Would that mean that the first process of the container did not register? It seems like you are suggesting that the registration by the 8th tid/pid causes a cascading registration progress, registering all tid/pids in the same grouping? Is that true? > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > container ID present on an auditable action or event. > > Forked and cloned processes inherit their parent's container ID, > referenced in the process' audit_context struct. So a cloned process with CLONE_NEWNS has the came container ID as the parent process that called clone, at least until the clone has time to change to a new container ID? Do you forsee any case where someone might need a semantic that is slightly different? For example wanting to set the container ID on clone? > Log the creation of every namespace, inheriting/adding its spawning > process' containerID(s), if applicable. Include the spawning and > spawned namespace IDs (device and inode number tuples). > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > Note: At this point it appears only network namespaces may need to track > container IDs apart from processes since incoming packets may cause an > auditable event before being associated with a process. OK. > Log the destruction of every namespace when it is no longer used by any > process, include the namespace IDs (device and inode number tuples). > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > the parent and child namespace IDs for any changes to a process' > namespaces. [setns(2)] > Note: It may be possible to combine AUDIT_NS_* record formats and > distinguish them with an op=$action field depending on the fields > required for each message type. > > A process can be moved from one container to another by using the > container assignment method outlined above a second time. OK. > When a container ceases to exist because the last process in that > container has exited and hence the last namespace has been destroyed and > its refcount dropping to zero, log the fact. > (This latter is likely needed for certification accountability.) A > container object may need a list of processes and/or namespaces. OK. > A namespace cannot directly migrate from one container to another but > could be assigned to a newly spawned container. A namespace can be > moved from one container to another indirectly by having that namespace > used in a second process in another container and then ending all the > processes in the first container. OK. > Feedback please. -- Cheers, Carlos. From rgb at redhat.com Thu Sep 14 05:30:08 2017 From: rgb at redhat.com (Richard Guy Briggs) Date: Thu, 14 Sep 2017 01:30:08 -0400 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com> References: <20170913171328.GP3405@madcap2.tricolour.ca> <9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com> Message-ID: <20170914053007.GR3405@madcap2.tricolour.ca> On 2017-09-13 14:33, Carlos O'Donell wrote: > On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > > Containers are a userspace concept. The kernel knows nothing of them. > > I am looking at this RFC from a userspace perspective, particularly from > the loader's point of view and the unshare syscall and the semantics that > arise from the use of it. > > At a high level what you are doing is providing a way to group, without > hierarchy, processes and namespaces. The processes can move between > container's if they have CAP_CONTAINER_ADMIN and can open and write to > a special proc file. > > * With unshare a thread may dissociate part of its execution context and > therefore see a distinct mount namespace. When you say "process" in this > particular RFC do you exclude the fact that a thread might be in a > distinct container from the rest of the threads in the process? > > > The Linux audit system needs a way to be able to track the container > > provenance of events and actions. Audit needs the kernel's help to do > > this. > > * Why does the Linux audit system need to tracker container provenance? - ability to filter unwanted, irrelevant or unimportant messages before they fill queue so important messages don't get lost. This is a certification requirement. - ability to make security claims about containers, require tracking of actions within those containers to ensure compliance with established security policies. - ability to route messages from events to relevant audit daemon instance or host audit daemon instance or both, as required or determined by user-initiated rules > - How does it help to provide better audit messages? > > - Is it be enough to list the namespace that a process occupies? We started with that approach back more than 4 years ago and found it helped, but didn't go far enough in terms of quick and inexpensive record filtering and left some doubt about provenance of events in the case of non-user context events (incoming network packets). > * Why does it need the kernel's help? > > - Is there a race condition that is only fixable with kernel support? This was a concern, but relatively minor compared with the other benefits. > - Or is it easier with kernel help but not required? It is much easier and much less expensive. > Providing background on these questions would help clarify the > design requirements. Here are some references that should help provide some background: https://github.com/linux-audit/audit-kernel/issues/32 RFE: add namespace IDs to audit records https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events SPEC Virtualization Manager Guest Lifecycle Events https://lwn.net/Articles/699819/ Audit, namespaces, and containers https://lwn.net/Articles/723561/ Containers as kernel objects (my reply, with references: https://lkml.org/lkml/2017/8/14/15 ) https://bugzilla.redhat.com/show_bug.cgi?id=1045666 audit: add namespace IDs to log records > > Since the concept of a container is entirely a userspace concept, a > > trigger signal from the userspace container orchestration system > > initiates this. This will define a point in time and a set of resources > > associated with a particular container with an audit container ID. > > Please don't use the word 'signal', I suggest 'register' since you are > writing to a filesystem. Ok, that's a very reasonable request. 'signal' has a previous meaning. > > The trigger is a pseudo filesystem (proc, since PID tree already exists) > > write of a u64 representing the container ID to a file representing a > > process that will become the first process in a new container. > > This might place restrictions on mount namespaces required to define a > > container, or at least careful checking of namespaces in the kernel to > > verify permissions of the orchestrator so it can't change its own > > container ID. > > A bind mount of nsfs may be necessary in the container orchestrator's > > mntNS. > > > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > > filesystem to have this action permitted. At that time, record the > > child container's user-supplied 64-bit container identifier along with > > What is a "child container?" Containers don't have any hierarchy. Maybe some don't, but that's not likely to last long given the abstraction and nesting of orchestration tools. This must be nestable. > I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents > your continued operation as we have today? Correct. It won't prevent processes that otherwise have permissions today from creating all the namespaces it wishes. > > the child container's first process (which may become the container's > > "init" process) process ID (referenced from the initial PID namespace), > > all namespace IDs (in the form of a nsfs device number and inode number > > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying > > op=$action field. > > What kind of requirement is there on the first tid/pid registering > the container ID? What if the 8th tid/pid does the registration? > Would that mean that the first process of the container did not > register? It seems like you are suggesting that the registration > by the 8th tid/pid causes a cascading registration progress, > registering all tid/pids in the same grouping? Is that true? Ah, good question, I forgot to address that fact. The intent is that either threaded processes after initiating threading will not have permission to execute this, or all the processes in the thread group will be forced into the same container. I don't have a strong opinion on whether or not it must be the lead thread process that must be the one to receive that registration, but I suspect that would be wise. > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > > container ID present on an auditable action or event. > > > > Forked and cloned processes inherit their parent's container ID, > > referenced in the process' audit_context struct. > > So a cloned process with CLONE_NEWNS has the came container ID > as the parent process that called clone, at least until the clone > has time to change to a new container ID? Yes. > Do you forsee any case where someone might need a semantic that is > slightly different? For example wanting to set the container ID on > clone? I could envision that situation and I think that might be workable but for the synchronicity of having one initiated by a specific syscall and the other initiated by a /proc write. > > Log the creation of every namespace, inheriting/adding its spawning > > process' containerID(s), if applicable. Include the spawning and > > spawned namespace IDs (device and inode number tuples). > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > > Note: At this point it appears only network namespaces may need to track > > container IDs apart from processes since incoming packets may cause an > > auditable event before being associated with a process. > > OK. > > > Log the destruction of every namespace when it is no longer used by any > > process, include the namespace IDs (device and inode number tuples). > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > > the parent and child namespace IDs for any changes to a process' > > namespaces. [setns(2)] > > Note: It may be possible to combine AUDIT_NS_* record formats and > > distinguish them with an op=$action field depending on the fields > > required for each message type. > > > > A process can be moved from one container to another by using the > > container assignment method outlined above a second time. > > OK. > > > When a container ceases to exist because the last process in that > > container has exited and hence the last namespace has been destroyed and > > its refcount dropping to zero, log the fact. > > (This latter is likely needed for certification accountability.) A > > container object may need a list of processes and/or namespaces. > > OK. > > > A namespace cannot directly migrate from one container to another but > > could be assigned to a newly spawned container. A namespace can be > > moved from one container to another indirectly by having that namespace > > used in a second process in another container and then ending all the > > processes in the first container. > > OK. > > > Feedback please. Thank you sir! > Carlos. - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From rgb at redhat.com Thu Sep 14 05:47:45 2017 From: rgb at redhat.com (Richard Guy Briggs) Date: Thu, 14 Sep 2017 01:47:45 -0400 Subject: [PATCH 2/9] Implement containers as kernel objects In-Reply-To: <20170906140341.GA8729@mail.hallyn.com> References: <149547014649.10599.12025037906646164347.stgit@warthog.procyon.org.uk> <149547016213.10599.1969443294414531853.stgit@warthog.procyon.org.uk> <20170814054711.GB29957@madcap2.tricolour.ca> <20170818080300.GQ7187@madcap2.tricolour.ca> <20170906140341.GA8729@mail.hallyn.com> Message-ID: <20170914054745.GS3405@madcap2.tricolour.ca> On 2017-09-06 09:03, Serge E. Hallyn wrote: > Quoting Richard Guy Briggs (rgb at redhat.com): > ... > > > I believe we are going to need a container ID to container definition > > > (namespace, etc.) mapping mechanism regardless of if the container ID > > > is provided by userspace or a kernel generated serial number. This > > > mapping should be recorded in the audit log when the container ID is > > > created/defined. > > > > Agreed. > > > > > > As was suggested in one of the previous threads, if there are any events not > > > > associated with a task (incoming network packets) we log the namespace ID and > > > > then only concern ourselves with its container serial number or container name > > > > once it becomes associated with a task at which point that tracking will be > > > > more important anyways. > > > > > > Agreed. After all, a single namespace can be shared between multiple > > > containers. For those security officers who need to track individual > > > events like this they will have the container ID mapping information > > > in the logs as well so they should be able to trace the unassociated > > > event to a set of containers. > > > > > > > I'm not convinced that a userspace or kernel generated UUID is that useful > > > > since they are large, not human readable and may not be globally unique given > > > > the "pets vs cattle" direction we are going with potentially identical > > > > conditions in hosts or containers spawning containers, but I see no need to > > > > restrict them. > > > > > > From a kernel perspective I think an int should suffice; after all, > > > you can't have more containers then you have processes. If the > > > container engine requires something more complex, it can use the int > > > as input to its own mapping function. > > > > PIDs roll over. That already causes some ambiguity in reporting. If a > > system is constantly spawning and reaping containers, especially > > single-process containers, I don't want to have to worry about that ID > > rolling to keep track of it even though there should be audit records of > > the spawn and death of each container. There isn't significant cost > > added here compared with some of the other overhead we're dealing with. > > Strawman proposal: > > 1. Each clone/unshare/setns involving a namespace type generates an audit > message along the lines of: > > PID 9512 (pid in init_pid_ns) in auditnsid 00000001 cloned CLONE_NEWNS|CLONE_NEWNET > new auditnsid: 00000002 > associated namespaces: (list of all namespace filesystem inode numbers) As you will have seen, this is pretty much what my most recent proposal suggests. > 2. Userspace (i.e. the container logging deamon here) can watch the audit log > for all messages relating to auditnsid 00000002. Presumably there will be > messages along the lines of "PID 9513 in auditnsid 00000002 cloned...". The > container logging daemon can track those messages and add the new auditnsids > to the list it watches. Yes. > 3. If a container is migrated (checkpointed and restored here or elsewhere), > userspace can just follow the appropriate logs for the new containers. Yes. > Userspace does not ever *request* a auditnsid. They are ephemeral, just a > tool to track the namespaces through the audit log. They are however guaranteed > to never be re-used until reboot. Well, this is where things get controvertial... I had wanted this, a kernel-generated serial number unique to a running kernel to track every container initiation, but this does have some CRIU challenges pointed out by Eric Biederman. Nested containers will not have a consistent view on a new host and no way to make it consistent. If we could guarantee that containers would never be nested, this could be workable. I think nesting is inevitable in the future given the variety and creativity of the orchestration tools, so restricting this seems short-sighted. At the moment the approch is to let the orchestrator determine the ID of a container. Some have argued for as small as u32 and others for a full UUID. A u32 runs the risk of rolling, so a u64 seems like a reasonable step to solve that issue. Others would like to be able to store a full UUID which seemed like a good idea on the outset, but on further thinking, this is something the orchestrator can manage while minimising the number of bits of required information per audit record to guarantee we can identify the provenance of a particular audit event. Let's see if we can make it work with a u64. > (Feels like someone must have proposed this before) Thsee ideas have been thrown around a few times and I'm starting to understand them better. > -serge - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From ebiederm at xmission.com Thu Sep 14 17:33:06 2017 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 14 Sep 2017 12:33:06 -0500 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <20170913171328.GP3405@madcap2.tricolour.ca> (Richard Guy Briggs's message of "Wed, 13 Sep 2017 13:13:28 -0400") References: <20170913171328.GP3405@madcap2.tricolour.ca> Message-ID: <87d16tb2y5.fsf@xmission.com> Richard Guy Briggs writes: > The trigger is a pseudo filesystem (proc, since PID tree already exists) > write of a u64 representing the container ID to a file representing a > process that will become the first process in a new container. > This might place restrictions on mount namespaces required to define a > container, or at least careful checking of namespaces in the kernel to > verify permissions of the orchestrator so it can't change its own > container ID. Why a u64? Why a proc filesystem write and not a magic audit message? I don't like the fact that the proc filesystem entry is likely going to be readable and abusable by non-audit contexts? Why the ability to change the containerid? What is the use case you are thinking of there? Eric From rgb at redhat.com Thu Sep 14 18:07:04 2017 From: rgb at redhat.com (Richard Guy Briggs) Date: Thu, 14 Sep 2017 14:07:04 -0400 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <87d16tb2y5.fsf@xmission.com> References: <20170913171328.GP3405@madcap2.tricolour.ca> <87d16tb2y5.fsf@xmission.com> Message-ID: <20170914180704.GU3405@madcap2.tricolour.ca> On 2017-09-14 12:33, Eric W. Biederman wrote: > Richard Guy Briggs writes: > > > The trigger is a pseudo filesystem (proc, since PID tree already exists) > > write of a u64 representing the container ID to a file representing a > > process that will become the first process in a new container. > > This might place restrictions on mount namespaces required to define a > > container, or at least careful checking of namespaces in the kernel to > > verify permissions of the orchestrator so it can't change its own > > container ID. > > Why a u64? u32 will roll too quickly. UUID is large enough that it adds significantly to audit record bandwidth. I'd prefer u64, but can look at the difference of accommodating a UUID... > Why a proc filesystem write and not a magic audit message? A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use sparingly. Given that orchestrators will already require it to send the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden. I was originally leaning towards an audit message trigger or a syscall. > I don't like the fact that the proc filesystem entry is likely going to > be readable and abusable by non-audit contexts? This proposal wasn't going to start with that link being readable, but its filesystem structure and link names would be, perhaps giving away too much already. I think we will need to find a way for the orchestrator or one of its authorized agents to read this information while blocking reads from unauthorized agents, otherwise this would be of very limited use. > Why the ability to change the containerid? What is the use case you are > thinking of there? This was covered in the end of the conversation with Paul Moore (that maybe you got tired reading?) I'd originally proposed having it write once, but Paul figured there was no good reason to restrict it and leave that decision up to the orchestrator. The use case would be adding other processes to a container, but it could be argued all additional processes should be spawned by the first process in a container. > Eric - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From rgb at redhat.com Fri Sep 15 10:19:11 2017 From: rgb at redhat.com (Richard Guy Briggs) Date: Fri, 15 Sep 2017 06:19:11 -0400 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <20170914053007.GR3405@madcap2.tricolour.ca> References: <20170913171328.GP3405@madcap2.tricolour.ca> <9043cc5a-e624-10c9-1906-f29010c5f57c@redhat.com> <20170914053007.GR3405@madcap2.tricolour.ca> Message-ID: <20170915101911.GA21172@madcap2.tricolour.ca> On 2017-09-14 01:30, Richard Guy Briggs wrote: > On 2017-09-13 14:33, Carlos O'Donell wrote: > > On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > > > Containers are a userspace concept. The kernel knows nothing of them. > > > > I am looking at this RFC from a userspace perspective, particularly from > > the loader's point of view and the unshare syscall and the semantics that > > arise from the use of it. > > > > At a high level what you are doing is providing a way to group, without > > hierarchy, processes and namespaces. The processes can move between > > container's if they have CAP_CONTAINER_ADMIN and can open and write to > > a special proc file. I should clarify: It wasn't intended that a process can see or modify its own or a peer's special proc container file to be able to set it or discover its value. This was only meant for its orchestrator or delegated agents to do. This can't be left only to CAP_CONTAINER_ADMIN. This may require a container to have its own mount namespace if the trigger mechanism is a proc file write. Other methods (additional namespaces?) may be needed to restrict it for other trigger methods (syscall?). > > * With unshare a thread may dissociate part of its execution context and > > therefore see a distinct mount namespace. When you say "process" in this > > particular RFC do you exclude the fact that a thread might be in a > > distinct container from the rest of the threads in the process? > > > > > The Linux audit system needs a way to be able to track the container > > > provenance of events and actions. Audit needs the kernel's help to do > > > this. > > > > * Why does the Linux audit system need to tracker container provenance? > > - ability to filter unwanted, irrelevant or unimportant messages before > they fill queue so important messages don't get lost. This is a > certification requirement. > > - ability to make security claims about containers, require tracking of > actions within those containers to ensure compliance with established > security policies. > > - ability to route messages from events to relevant audit daemon > instance or host audit daemon instance or both, as required or > determined by user-initiated rules > > > - How does it help to provide better audit messages? > > > > - Is it be enough to list the namespace that a process occupies? > > We started with that approach back more than 4 years ago and found it > helped, but didn't go far enough in terms of quick and inexpensive > record filtering and left some doubt about provenance of events in the > case of non-user context events (incoming network packets). > > > * Why does it need the kernel's help? > > > > - Is there a race condition that is only fixable with kernel support? > > This was a concern, but relatively minor compared with the other benefits. > > > - Or is it easier with kernel help but not required? > > It is much easier and much less expensive. > > > Providing background on these questions would help clarify the > > design requirements. > > Here are some references that should help provide some background: > https://github.com/linux-audit/audit-kernel/issues/32 > RFE: add namespace IDs to audit records > > https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events > SPEC Virtualization Manager Guest Lifecycle Events > > https://lwn.net/Articles/699819/ > Audit, namespaces, and containers > > https://lwn.net/Articles/723561/ > Containers as kernel objects > (my reply, with references: https://lkml.org/lkml/2017/8/14/15 ) > > https://bugzilla.redhat.com/show_bug.cgi?id=1045666 > audit: add namespace IDs to log records > > > > Since the concept of a container is entirely a userspace concept, a > > > trigger signal from the userspace container orchestration system > > > initiates this. This will define a point in time and a set of resources > > > associated with a particular container with an audit container ID. > > > > Please don't use the word 'signal', I suggest 'register' since you are > > writing to a filesystem. > > Ok, that's a very reasonable request. 'signal' has a previous meaning. > > > > The trigger is a pseudo filesystem (proc, since PID tree already exists) > > > write of a u64 representing the container ID to a file representing a > > > process that will become the first process in a new container. > > > This might place restrictions on mount namespaces required to define a > > > container, or at least careful checking of namespaces in the kernel to > > > verify permissions of the orchestrator so it can't change its own > > > container ID. > > > A bind mount of nsfs may be necessary in the container orchestrator's > > > mntNS. > > > > > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > > > filesystem to have this action permitted. At that time, record the > > > child container's user-supplied 64-bit container identifier along with > > > > What is a "child container?" Containers don't have any hierarchy. > > Maybe some don't, but that's not likely to last long given the > abstraction and nesting of orchestration tools. This must be nestable. This is why we can't rely only on CAP_CONTAINER_ADMIN to restrict the ability for self-modification or self-discovery. > > I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents > > your continued operation as we have today? > > Correct. It won't prevent processes that otherwise have permissions > today from creating all the namespaces it wishes. > > > > the child container's first process (which may become the container's > > > "init" process) process ID (referenced from the initial PID namespace), > > > all namespace IDs (in the form of a nsfs device number and inode number > > > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying > > > op=$action field. > > > > What kind of requirement is there on the first tid/pid registering > > the container ID? What if the 8th tid/pid does the registration? > > Would that mean that the first process of the container did not > > register? It seems like you are suggesting that the registration > > by the 8th tid/pid causes a cascading registration progress, > > registering all tid/pids in the same grouping? Is that true? > > Ah, good question, I forgot to address that fact. The intent is that > either threaded processes after initiating threading will not have > permission to execute this, or all the processes in the thread group > will be forced into the same container. I don't have a strong opinion > on whether or not it must be the lead thread process that must be the > one to receive that registration, but I suspect that would be wise. > > > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > > > container ID present on an auditable action or event. > > > > > > Forked and cloned processes inherit their parent's container ID, > > > referenced in the process' audit_context struct. > > > > So a cloned process with CLONE_NEWNS has the came container ID > > as the parent process that called clone, at least until the clone > > has time to change to a new container ID? > > Yes. And as pointed to above, it isn't the process itself that is able to change to a new container, but its orchestrator to move/assign it. > > Do you forsee any case where someone might need a semantic that is > > slightly different? For example wanting to set the container ID on > > clone? > > I could envision that situation and I think that might be workable but > for the synchronicity of having one initiated by a specific syscall and > the other initiated by a /proc write. The ability to clone while providing a containerID would work really well, but I'm hesitant to extend or duplicate the clone call. This actually sounds like a potentially sane way of approaching it. > > > Log the creation of every namespace, inheriting/adding its spawning > > > process' containerID(s), if applicable. Include the spawning and > > > spawned namespace IDs (device and inode number tuples). > > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > > > Note: At this point it appears only network namespaces may need to track > > > container IDs apart from processes since incoming packets may cause an > > > auditable event before being associated with a process. > > > > OK. > > > > > Log the destruction of every namespace when it is no longer used by any > > > process, include the namespace IDs (device and inode number tuples). > > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > > > > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > > > the parent and child namespace IDs for any changes to a process' > > > namespaces. [setns(2)] > > > Note: It may be possible to combine AUDIT_NS_* record formats and > > > distinguish them with an op=$action field depending on the fields > > > required for each message type. > > > > > > A process can be moved from one container to another by using the > > > container assignment method outlined above a second time. > > > > OK. > > > > > When a container ceases to exist because the last process in that > > > container has exited and hence the last namespace has been destroyed and > > > its refcount dropping to zero, log the fact. > > > (This latter is likely needed for certification accountability.) A > > > container object may need a list of processes and/or namespaces. > > > > OK. > > > > > A namespace cannot directly migrate from one container to another but > > > could be assigned to a newly spawned container. A namespace can be > > > moved from one container to another indirectly by having that namespace > > > used in a second process in another container and then ending all the > > > processes in the first container. > > > > OK. > > > > > Feedback please. > > Thank you sir! > > > Carlos. > > - RGB > > -- > Richard Guy Briggs > Sr. S/W Engineer, Kernel Security, Base Operating Systems > Remote, Ottawa, Red Hat Canada > IRC: rgb, SunRaycer > Voice: +1.647.777.2635, Internal: (81) 32635 > > -- > Linux-audit mailing list > Linux-audit at redhat.com > https://www.redhat.com/mailman/listinfo/linux-audit - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From ebiederm at xmission.com Tue Sep 19 02:45:19 2017 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 18 Sep 2017 21:45:19 -0500 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <20170914180704.GU3405@madcap2.tricolour.ca> (Richard Guy Briggs's message of "Thu, 14 Sep 2017 14:07:04 -0400") References: <20170913171328.GP3405@madcap2.tricolour.ca> <87d16tb2y5.fsf@xmission.com> <20170914180704.GU3405@madcap2.tricolour.ca> Message-ID: <87wp4v76f4.fsf@xmission.com> Richard Guy Briggs writes: > On 2017-09-14 12:33, Eric W. Biederman wrote: >> Richard Guy Briggs writes: >> >> > The trigger is a pseudo filesystem (proc, since PID tree already exists) >> > write of a u64 representing the container ID to a file representing a >> > process that will become the first process in a new container. >> > This might place restrictions on mount namespaces required to define a >> > container, or at least careful checking of namespaces in the kernel to >> > verify permissions of the orchestrator so it can't change its own >> > container ID. >> >> Why a u64? > > u32 will roll too quickly. UUID is large enough that it adds > significantly to audit record bandwidth. I'd prefer u64, but can look > at the difference of accommodating a UUID... I was imagining a string might be better. As for the purposes of audit it is just a byte string you regurgitate. >> Why a proc filesystem write and not a magic audit message? > > A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use > sparingly. Given that orchestrators will already require it to send > the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden. > > I was originally leaning towards an audit message trigger or a syscall. > >> I don't like the fact that the proc filesystem entry is likely going to >> be readable and abusable by non-audit contexts? > > This proposal wasn't going to start with that link being readable, but > its filesystem structure and link names would be, perhaps giving away > too much already. > > I think we will need to find a way for the orchestrator or one of its > authorized agents to read this information while blocking reads from > unauthorized agents, otherwise this would be of very limited use. Something that is set only for future audit messages seems reasonable. Once you start reading this from something other than audit messages I get neverous, that people will use this beyond audit for things it is not intended for. >> Why the ability to change the containerid? What is the use case you are >> thinking of there? > > This was covered in the end of the conversation with Paul Moore (that > maybe you got tired reading?) I have not had time to review everything. As I was busy preparing for my wedding and am now in the middle of my honeymoon. > I'd originally proposed having it write > once, but Paul figured there was no good reason to restrict it and leave > that decision up to the orchestrator. The use case would be adding > other processes to a container, but it could be argued all additional > processes should be spawned by the first process in a container. I see two cases here: a) Nested containers b) Inject processes via something like nsenter into a container. In case a) you have to figure out what to do with nested containers and that does seem to be a legitimate case for a double write. Arguably with the restriction that you must specify a more nested label. In case b) which you seem to be referring to it would be a process created by the container manager outside the container that has no container label. At which point there is not a need for a double write. So my recommendation is to not support double writes until you support nested containers. Eric From rgb at redhat.com Tue Sep 19 04:15:05 2017 From: rgb at redhat.com (Richard Guy Briggs) Date: Tue, 19 Sep 2017 00:15:05 -0400 Subject: RFC: Audit Kernel Container IDs In-Reply-To: <87wp4v76f4.fsf@xmission.com> References: <20170913171328.GP3405@madcap2.tricolour.ca> <87d16tb2y5.fsf@xmission.com> <20170914180704.GU3405@madcap2.tricolour.ca> <87wp4v76f4.fsf@xmission.com> Message-ID: <20170919041505.GQ3405@madcap2.tricolour.ca> On 2017-09-18 21:45, Eric W. Biederman wrote: > Richard Guy Briggs writes: > > > On 2017-09-14 12:33, Eric W. Biederman wrote: > >> Richard Guy Briggs writes: > >> > >> > The trigger is a pseudo filesystem (proc, since PID tree already exists) > >> > write of a u64 representing the container ID to a file representing a > >> > process that will become the first process in a new container. > >> > This might place restrictions on mount namespaces required to define a > >> > container, or at least careful checking of namespaces in the kernel to > >> > verify permissions of the orchestrator so it can't change its own > >> > container ID. > >> > >> Why a u64? > > > > u32 will roll too quickly. UUID is large enough that it adds > > significantly to audit record bandwidth. I'd prefer u64, but can look > > at the difference of accommodating a UUID... > > I was imagining a string might be better. As for the purposes of audit > it is just a byte string you regurgitate. Yes, so looking at u128 vs dhowells' proposal, it would be 16 bytes vs 24 bytes, which really isn't that much difference... What length of string length were you envisioning? > >> Why a proc filesystem write and not a magic audit message? > > > > A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use > > sparingly. Given that orchestrators will already require it to send > > the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden. > > > > I was originally leaning towards an audit message trigger or a syscall. > > > >> I don't like the fact that the proc filesystem entry is likely going to > >> be readable and abusable by non-audit contexts? > > > > This proposal wasn't going to start with that link being readable, but > > its filesystem structure and link names would be, perhaps giving away > > too much already. > > > > I think we will need to find a way for the orchestrator or one of its > > authorized agents to read this information while blocking reads from > > unauthorized agents, otherwise this would be of very limited use. > > Something that is set only for future audit messages seems reasonable. > Once you start reading this from something other than audit messages I > get neverous, that people will use this beyond audit for things it is > not intended for. Understandably. At the same time, if we implement something that is more broadly useful and solves a number of other challenges others are facing, how can we make it available while limiting the potential for abuse? > >> Why the ability to change the containerid? What is the use case you are > >> thinking of there? > > > > This was covered in the end of the conversation with Paul Moore (that > > maybe you got tired reading?) > > I have not had time to review everything. As I was busy preparing for my > wedding and am now in the middle of my honeymoon. I'm very sorry, my bad! You had given me a heads up about this and I appologise for causing a stir during your special time. > > I'd originally proposed having it write > > once, but Paul figured there was no good reason to restrict it and leave > > that decision up to the orchestrator. The use case would be adding > > other processes to a container, but it could be argued all additional > > processes should be spawned by the first process in a container. > > I see two cases here: > a) Nested containers > b) Inject processes via something like nsenter into a container. > > In case a) you have to figure out what to do with nested containers > and that does seem to be a legitimate case for a double write. Arguably > with the restriction that you must specify a more nested label. Is this technically a double write if it is an inheritance? That should be solvable with a flag. > In case b) which you seem to be referring to it would be a process > created by the container manager outside the container that has no > container label. At which point there is not a need for a double write. Looking at the potential for nesting, if the orchestrator is already in a container, then it would already have a label, but if we refer to the flag solution above, then it is still the first write. > So my recommendation is to not support double writes until you support > nested containers. I think this is a reasonable restriction. Thanks for your time. Sorry to disturb your holiday. > Eric - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From francis at targetb2bleads.com Thu Sep 21 19:29:50 2017 From: francis at targetb2bleads.com (Francis A Carey) Date: Thu, 21 Sep 2017 15:29:50 -0400 Subject: Oracle Open World 2017 Attendees List Message-ID: Hi, Hope this note finds you well. I thought I'd check if you would be interested in acquiring the attendees list of "Oracle Open World 2017" for pre-show marketing campaign, Appointment Setting, Networking and various Marketing initiative which is held on 01st - 05th Oct |San Francisco, CA| USA. Complete Data fields with 90% accuracy guaranteed on emails: Each record will contain details like: Company name, Website, Contact name, Postal address, Phone number, Fax Number And Verified Email Address. If you are interested, drop me a line. We will get back to you with pricing, counts and other information for your review. Thank you and I look forward to hear from you soon. Regards, Francis A Carey| Inside Sales, USA & Europe| Email: francis at targetb2bleads.com "If you don't wish to receive emails from us reply back with LEAVE OUT" From noreply at jiiga.com Mon Sep 25 04:42:50 2017 From: noreply at jiiga.com (Canadian-Pharmacy) Date: Mon, 25 Sep 2017 01:42:50 -0300 Subject: We don't believe in magic and miracles when it comes to our clients' health! Be sure! Message-ID: Excellent service. Reliable delivery! ENTER HERE From emmayang at sunwardstone.com Mon Sep 25 17:19:00 2017 From: emmayang at sunwardstone.com (Emma) Date: Tue, 26 Sep 2017 01:19:00 +0800 (CST) Subject: Hot Sale-Sunward Quartz and Countertops Message-ID: <5b6fdfd15eb11457015eb6e4638e0cc6@35MA.sunwardstone.com> Dear Friends, Good day to you! Hope you everything goes well.How are you recently?We would like to forward some of our countertops photos and quartz slabs new price lists to you for checking.Please see attachment.These countertops are shipping to our other North American customers.Our company supply a lot of these products.If you like,you can feel free to contact us. Our company is a professional manufacturer and exporter of varies kinds of stone products in Xiamen,China since 2002,mainly producing Granite,Marble,Quartz Countertops & Kitchentops,Cut to Size Tiles& Slabs,Cobblestone and Mosaics.We have attended covering in USA and Marmomacc in Verona,Italy every year.We export a lot of countertops,cut to size tiles and slabs to all over the world.You could visit our website for more our stone products informations.We sincerely hope that we could have a chance to do business with you in this year. Hope to hear from you soon.Thanks. Yours sincerley, Emma XIAMEN SUNWARD IMP.& EXP. TRADE CO., LTD. Mobile:0086-13600938482 Tel:0086-592-5901718 Fax:0086-592-5361988, What'sapp:008613600938482 Skype:emmayang0592 We chat:8273227 Website:www.business-stone.com -------------- next part -------------- A non-text attachment was scrubbed... Name: Royal Jade.JPG Type: application/octet-stream Size: 110656 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Royal Jade-3.JPG Type: application/octet-stream Size: 94898 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G687.JPG Type: application/octet-stream Size: 157048 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G687-2.JPG Type: application/octet-stream Size: 122786 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G655.JPG Type: application/octet-stream Size: 129071 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G655-2.JPG Type: application/octet-stream Size: 181325 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G655-3.JPG Type: application/octet-stream Size: 194468 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Ariston Gold Prefab.JPG Type: application/octet-stream Size: 141957 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Butterfly Yellow (2).JPG Type: application/octet-stream Size: 136682 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Giallo Ornamental Tabletops.JPG Type: application/octet-stream Size: 221071 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Hebei Black Tabletop (3).JPG Type: application/octet-stream Size: 126882 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Palo flower.JPG Type: application/octet-stream Size: 293652 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sesame white countertops.jpg Type: application/octet-stream Size: 99188 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sw white quartz countertops-1.jpg Type: application/octet-stream Size: 791962 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SW7101 Carrara White Quartz Rond Tabletop.JPG Type: application/octet-stream Size: 126413 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Tropical Brown.JPG Type: application/octet-stream Size: 279234 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Tropical Brown-1.jpg Type: application/octet-stream Size: 516550 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G682 island.JPG Type: application/octet-stream Size: 159946 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G687 island.JPG Type: application/octet-stream Size: 167483 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sunward Swan White Granite Kitchentop-1.jpg Type: application/octet-stream Size: 205516 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Crystal White Artificial Quartz Countertop (2).JPG Type: application/octet-stream Size: 345480 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Carrara White Artificial Quartz Countertop (1).JPG Type: application/octet-stream Size: 771041 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Fantasy Gold (1).jpg Type: application/octet-stream Size: 226024 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Fantasy Gold (2).jpg Type: application/octet-stream Size: 233615 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: G682A# (12).jpg Type: application/octet-stream Size: 222054 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: South African Gold (1).JPG Type: application/octet-stream Size: 411175 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Galaxy White-1.jpg Type: application/octet-stream Size: 123051 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Galaxy White-2.jpg Type: application/octet-stream Size: 120412 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: White Rose Vanity Top-3.jpg Type: application/octet-stream Size: 129247 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sunward Artificial Quartz Slab Price-2017.xls Type: application/octet-stream Size: 1641472 bytes Desc: not available URL: From ebiederm at xmission.com Thu Sep 28 22:34:53 2017 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 28 Sep 2017 17:34:53 -0500 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave Message-ID: <87tvzmqwoi.fsf@xmission.com> It looks like once upon a time a long time ago selinux copied code from cap_inode_removexattr and cap_inode_setxattr into selinux_inode_setotherxattr. However the code has now diverged and selinux is implementing a policy that is quite different than cap_inode_setxattr and cap_inode_removexattr especially when it comes to the security.capable xattr. To keep things working and to make the comments in security/security.c correct when the xattr is securit.capable, call cap_inode_setxattr or cap_inode_removexattr as appropriate. I suspect there is a larger conversation to be had here but this is enough to keep selinux from implementing a non-sense hard coded policy that breaks other parts of the kernel. Signed-off-by: "Eric W. Biederman" --- security/selinux/hooks.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index f5d304736852..edf4bd292dc7 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct dentry *dentry, const char *name, u32 newsid, sid = current_sid(); int rc = 0; + if (strcmp(name, XATTR_NAME_CAPS) == 0) + return cap_inode_setxattr(dentry, name, value, size, flags); + if (strcmp(name, XATTR_NAME_SELINUX)) return selinux_inode_setotherxattr(dentry, name); @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct dentry *dentry) static int selinux_inode_removexattr(struct dentry *dentry, const char *name) { + if (strcmp(name, XATTR_NAME_CAPS) == 0) + return cap_inode_removexattr(dentry, name); + if (strcmp(name, XATTR_NAME_SELINUX)) return selinux_inode_setotherxattr(dentry, name); -- 2.14.1 From casey at schaufler-ca.com Fri Sep 29 01:16:06 2017 From: casey at schaufler-ca.com (Casey Schaufler) Date: Thu, 28 Sep 2017 18:16:06 -0700 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: <87tvzmqwoi.fsf@xmission.com> References: <87tvzmqwoi.fsf@xmission.com> Message-ID: <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> On 9/28/2017 3:34 PM, Eric W. Biederman wrote: > It looks like once upon a time a long time ago selinux copied code > from cap_inode_removexattr and cap_inode_setxattr into > selinux_inode_setotherxattr. However the code has now diverged and > selinux is implementing a policy that is quite different than > cap_inode_setxattr and cap_inode_removexattr especially when it comes > to the security.capable xattr. What leads you to believe that this isn't intentional? It's most likely the case that this change occurred as part of the first round module stacking change. What behavior do you see that you're unhappy with? > > To keep things working Which "things"? How are they not "working"? > and to make the comments in security/security.c > correct when the xattr is securit.capable, call cap_inode_setxattr > or cap_inode_removexattr as appropriate. > > I suspect there is a larger conversation to be had here but this > is enough to keep selinux from implementing a non-sense hard coded > policy that breaks other parts of the kernel. Specifics, please. Since I can't guess what problem you've encountered I can't tell if it's here, in the infrastructure, or in your perception of what constitutes "broken". > > Signed-off-by: "Eric W. Biederman" > --- > security/selinux/hooks.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > index f5d304736852..edf4bd292dc7 100644 > --- a/security/selinux/hooks.c > +++ b/security/selinux/hooks.c > @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct dentry *dentry, const char *name, > u32 newsid, sid = current_sid(); > int rc = 0; > > + if (strcmp(name, XATTR_NAME_CAPS) == 0) > + return cap_inode_setxattr(dentry, name, value, size, flags); > + No. Don't even think of contemplating considering embedding the cap attribute check in the SELinux code. cap_inode_setxattr() is called in the infrastructure. > if (strcmp(name, XATTR_NAME_SELINUX)) > return selinux_inode_setotherxattr(dentry, name); > > @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct dentry *dentry) > > static int selinux_inode_removexattr(struct dentry *dentry, const char *name) > { > + if (strcmp(name, XATTR_NAME_CAPS) == 0) > + return cap_inode_removexattr(dentry, name); > + > if (strcmp(name, XATTR_NAME_SELINUX)) > return selinux_inode_setotherxattr(dentry, name); > . From sds at tycho.nsa.gov Fri Sep 29 12:36:41 2017 From: sds at tycho.nsa.gov (Stephen Smalley) Date: Fri, 29 Sep 2017 08:36:41 -0400 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: <87tvzmqwoi.fsf@xmission.com> References: <87tvzmqwoi.fsf@xmission.com> Message-ID: <1506688601.5571.1.camel@tycho.nsa.gov> On Thu, 2017-09-28 at 17:34 -0500, Eric W. Biederman wrote: > It looks like once upon a time a long time ago selinux copied code > from cap_inode_removexattr and cap_inode_setxattr into > selinux_inode_setotherxattr.??However the code has now diverged and > selinux is implementing a policy that is quite different than > cap_inode_setxattr and cap_inode_removexattr especially when it comes > to the security.capable xattr. > > To keep things working and to make the comments in > security/security.c > correct when the xattr is securit.capable, call cap_inode_setxattr > or cap_inode_removexattr as appropriate. > > I suspect there is a larger conversation to be had here but this > is enough to keep selinux from implementing a non-sense hard coded > policy that breaks other parts of the kernel. Originally SELinux called the cap functions directly since there was no stacking support in the infrastructure and one had to manually stack a secondary module internally. inode_setxattr and inode_removexattr however were special cases because the cap functions would check CAP_SYS_ADMIN for any non-capability attributes in the security.* namespace, and we don't want to impose that requirement on setting security.selinux. Thus, we inlined the capabilities logic into the selinux hook functions and adapted it appropriately. When the stacking support was introduced, it had to also special case these hooks so that only the primary module's hook is used for the same reason; otherwise, the kernel would end up applying a CAP_SYS_ADMIN check on setting security.selinux. Your change below is almost but not quite right since it only calls the cap functions when setting the capability attribute; the residual problem is that it will then skip the SELinux FILE__SETATTR (file setattr) permission check when setting those attributes, which we want to retain. So you need to only return early if cap_inode_setxattr()/removexattr() return an error; otherwise, you need to proceed to the SELinux check, and you can then delete the duplicated logic from selinux_inode_setotherxattr(). At which point it just becomes a call to dentry_has_perm() and you can just inline that into selinux_inode_setxattr() and selinux_inode_removexattr(). > > Signed-off-by: "Eric W. Biederman" > --- > ?security/selinux/hooks.c | 6 ++++++ > ?1 file changed, 6 insertions(+) > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > index f5d304736852..edf4bd292dc7 100644 > --- a/security/selinux/hooks.c > +++ b/security/selinux/hooks.c > @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct dentry > *dentry, const char *name, > ? u32 newsid, sid = current_sid(); > ? int rc = 0; > ? > + if (strcmp(name, XATTR_NAME_CAPS) == 0) > + return cap_inode_setxattr(dentry, name, value, size, > flags); > + > ? if (strcmp(name, XATTR_NAME_SELINUX)) > ? return selinux_inode_setotherxattr(dentry, name); > ? > @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct > dentry *dentry) > ? > ?static int selinux_inode_removexattr(struct dentry *dentry, const > char *name) > ?{ > + if (strcmp(name, XATTR_NAME_CAPS) == 0) > + return cap_inode_removexattr(dentry, name); > + > ? if (strcmp(name, XATTR_NAME_SELINUX)) > ? return selinux_inode_setotherxattr(dentry, name); > ? From sds at tycho.nsa.gov Fri Sep 29 14:18:57 2017 From: sds at tycho.nsa.gov (Stephen Smalley) Date: Fri, 29 Sep 2017 10:18:57 -0400 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> References: <87tvzmqwoi.fsf@xmission.com> <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> Message-ID: <1506694737.5571.9.camel@tycho.nsa.gov> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote: > On 9/28/2017 3:34 PM, Eric W. Biederman wrote: > > It looks like once upon a time a long time ago selinux copied code > > from cap_inode_removexattr and cap_inode_setxattr into > > selinux_inode_setotherxattr.??However the code has now diverged and > > selinux is implementing a policy that is quite different than > > cap_inode_setxattr and cap_inode_removexattr especially when it > > comes > > to the security.capable xattr. > > What leads you to believe that this isn't intentional? > It's most likely the case that this change occurred as > part of the first round module stacking change. What behavior > do you see that you're unhappy with? > > > > > To keep things working > > Which "things"? How are they not "working"? > > > ?and to make the comments in security/security.c > > correct when the xattr is securit.capable, call cap_inode_setxattr > > or cap_inode_removexattr as appropriate. > > > > I suspect there is a larger conversation to be had here but this > > is enough to keep selinux from implementing a non-sense hard coded > > policy that breaks other parts of the kernel. > > Specifics, please. Since I can't guess what problem you've > encountered I can't tell if it's here, in the infrastructure, > or in your perception of what constitutes "broken". > > > > > Signed-off-by: "Eric W. Biederman" > > --- > > ?security/selinux/hooks.c | 6 ++++++ > > ?1 file changed, 6 insertions(+) > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > index f5d304736852..edf4bd292dc7 100644 > > --- a/security/selinux/hooks.c > > +++ b/security/selinux/hooks.c > > @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct > > dentry *dentry, const char *name, > > ? u32 newsid, sid = current_sid(); > > ? int rc = 0; > > ? > > + if (strcmp(name, XATTR_NAME_CAPS) == 0) > > + return cap_inode_setxattr(dentry, name, value, > > size, flags); > > + > > No. Don't even think of contemplating considering embedding the cap > attribute check in the SELinux code. cap_inode_setxattr() is called > in > the infrastructure. Except that it isn't, not if any other security module is enabled and implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when setting security.selinux or security.SMACK*. An alternative approach to fixing this would be to change the cap functions to only apply their checks if setting the capability attribute and defer any checks on other security.* attributes to either the security framework or the other security modules. Then the framework could always call all the modules on the inode_setxattr and inode_removexattr hooks as with other hooks. The security framework would then need to ensure that a check is still applied when setting security.* attributes if it isn't already handled by one of the enabled security modules, as you don't want unprivileged userspace to be able to set arbitrary security.foo attributes or to set up security.selinux or security.SMACK* attributes if those modules happen to be disabled. > ? > > > ? if (strcmp(name, XATTR_NAME_SELINUX)) > > ? return selinux_inode_setotherxattr(dentry, name); > > ? > > @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct > > dentry *dentry) > > ? > > ?static int selinux_inode_removexattr(struct dentry *dentry, const > > char *name) > > ?{ > > + if (strcmp(name, XATTR_NAME_CAPS) == 0) > > + return cap_inode_removexattr(dentry, name); > > + > > ? if (strcmp(name, XATTR_NAME_SELINUX)) > > ? return selinux_inode_setotherxattr(dentry, name); > > ? > > > . From casey at schaufler-ca.com Fri Sep 29 15:46:21 2017 From: casey at schaufler-ca.com (Casey Schaufler) Date: Fri, 29 Sep 2017 08:46:21 -0700 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: <1506694737.5571.9.camel@tycho.nsa.gov> References: <87tvzmqwoi.fsf@xmission.com> <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> <1506694737.5571.9.camel@tycho.nsa.gov> Message-ID: <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com> On 9/29/2017 7:18 AM, Stephen Smalley wrote: > On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote: >> On 9/28/2017 3:34 PM, Eric W. Biederman wrote: >>> It looks like once upon a time a long time ago selinux copied code >>> from cap_inode_removexattr and cap_inode_setxattr into >>> selinux_inode_setotherxattr.??However the code has now diverged and >>> selinux is implementing a policy that is quite different than >>> cap_inode_setxattr and cap_inode_removexattr especially when it >>> comes >>> to the security.capable xattr. >> What leads you to believe that this isn't intentional? >> It's most likely the case that this change occurred as >> part of the first round module stacking change. What behavior >> do you see that you're unhappy with? >> >>> To keep things working >> Which "things"? How are they not "working"? >> >>> ?and to make the comments in security/security.c >>> correct when the xattr is securit.capable, call cap_inode_setxattr >>> or cap_inode_removexattr as appropriate. >>> >>> I suspect there is a larger conversation to be had here but this >>> is enough to keep selinux from implementing a non-sense hard coded >>> policy that breaks other parts of the kernel. >> Specifics, please. Since I can't guess what problem you've >> encountered I can't tell if it's here, in the infrastructure, >> or in your perception of what constitutes "broken". >> >>> Signed-off-by: "Eric W. Biederman" >>> --- >>> ?security/selinux/hooks.c | 6 ++++++ >>> ?1 file changed, 6 insertions(+) >>> >>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c >>> index f5d304736852..edf4bd292dc7 100644 >>> --- a/security/selinux/hooks.c >>> +++ b/security/selinux/hooks.c >>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct >>> dentry *dentry, const char *name, >>> ? u32 newsid, sid = current_sid(); >>> ? int rc = 0; >>> ? >>> + if (strcmp(name, XATTR_NAME_CAPS) == 0) >>> + return cap_inode_setxattr(dentry, name, value, >>> size, flags); >>> + >> No. Don't even think of contemplating considering embedding the cap >> attribute check in the SELinux code. cap_inode_setxattr() is called >> in >> the infrastructure. > Except that it isn't, not if any other security module is enabled and > implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when > setting security.selinux or security.SMACK*. OK. Yes, this bit of the infrastructure is some of the worst I've done in a long time. This is a case where we already need special case stacking infrastructure. It looks like we'll have to separate setting the cap attribute from checking the cap state in order to make this work. In any case, the security_inode_setxattr() code is where the change belongs. There will likely be fallout changes in the modules, including the cap module. ? > An alternative approach to fixing this would be to change the cap > functions to only apply their checks if setting the capability > attribute and defer any checks on other security.* attributes to either > the security framework or the other security modules. Then the > framework could always call all the modules on the inode_setxattr and > inode_removexattr hooks as with other hooks. The security framework > would then need to ensure that a check is still applied when setting > security.* attributes if it isn't already handled by one of the enabled > security modules, as you don't want unprivileged userspace to be able > to set arbitrary security.foo attributes or to set up security.selinux > or security.SMACK* attributes if those modules happen to be disabled. Agreed. This isn't a two line change. Grumble. I can guess at what the problem might be, but I hate making assumptions when I go to fix a problem. I will start looking at a patch, but it would really help if I could say for sure what I'm out to accomplish. It may be obvious to the casual observer, but that description has not been applied to me very often. > >> ? >> >>> ? if (strcmp(name, XATTR_NAME_SELINUX)) >>> ? return selinux_inode_setotherxattr(dentry, name); >>> ? >>> @@ -3282,6 +3285,9 @@ static int selinux_inode_listxattr(struct >>> dentry *dentry) >>> ? >>> ?static int selinux_inode_removexattr(struct dentry *dentry, const >>> char *name) >>> ?{ >>> + if (strcmp(name, XATTR_NAME_CAPS) == 0) >>> + return cap_inode_removexattr(dentry, name); >>> + >>> ? if (strcmp(name, XATTR_NAME_SELINUX)) >>> ? return selinux_inode_setotherxattr(dentry, name); >>> ? >> >> . > -- > To unsubscribe from this list: send the line "unsubscribe linux-security-module" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > . From ebiederm at xmission.com Sat Sep 30 16:22:55 2017 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 30 Sep 2017 11:22:55 -0500 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com> (Casey Schaufler's message of "Fri, 29 Sep 2017 08:46:21 -0700") References: <87tvzmqwoi.fsf@xmission.com> <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> <1506694737.5571.9.camel@tycho.nsa.gov> <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com> Message-ID: <87vak0ma00.fsf@xmission.com> Casey Schaufler writes: > On 9/29/2017 7:18 AM, Stephen Smalley wrote: >> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote: >>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote: >>>> It looks like once upon a time a long time ago selinux copied code >>>> from cap_inode_removexattr and cap_inode_setxattr into >>>> selinux_inode_setotherxattr.??However the code has now diverged and >>>> selinux is implementing a policy that is quite different than >>>> cap_inode_setxattr and cap_inode_removexattr especially when it >>>> comes >>>> to the security.capable xattr. >>> What leads you to believe that this isn't intentional? >>> It's most likely the case that this change occurred as >>> part of the first round module stacking change. What behavior >>> do you see that you're unhappy with? >>> >>>> To keep things working >>> Which "things"? How are they not "working"? >>> >>>> ?and to make the comments in security/security.c >>>> correct when the xattr is securit.capable, call cap_inode_setxattr >>>> or cap_inode_removexattr as appropriate. >>>> >>>> I suspect there is a larger conversation to be had here but this >>>> is enough to keep selinux from implementing a non-sense hard coded >>>> policy that breaks other parts of the kernel. >>> Specifics, please. Since I can't guess what problem you've >>> encountered I can't tell if it's here, in the infrastructure, >>> or in your perception of what constitutes "broken". >>> >>>> Signed-off-by: "Eric W. Biederman" >>>> --- >>>> ?security/selinux/hooks.c | 6 ++++++ >>>> ?1 file changed, 6 insertions(+) >>>> >>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c >>>> index f5d304736852..edf4bd292dc7 100644 >>>> --- a/security/selinux/hooks.c >>>> +++ b/security/selinux/hooks.c >>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct >>>> dentry *dentry, const char *name, >>>> ? u32 newsid, sid = current_sid(); >>>> ? int rc = 0; >>>> ? >>>> + if (strcmp(name, XATTR_NAME_CAPS) == 0) >>>> + return cap_inode_setxattr(dentry, name, value, >>>> size, flags); >>>> + >>> No. Don't even think of contemplating considering embedding the cap >>> attribute check in the SELinux code. cap_inode_setxattr() is called >>> in >>> the infrastructure. >> Except that it isn't, not if any other security module is enabled and >> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when >> setting security.selinux or security.SMACK*. > > OK. Yes, this bit of the infrastructure is some of the > worst I've done in a long time. This is a case where we > already need special case stacking infrastructure. It looks > like we'll have to separate setting the cap attribute from > checking the cap state in order to make this work. In any > case, the security_inode_setxattr() code is where the change > belongs. There will likely be fallout changes in the modules, > including the cap module. > ? > >> An alternative approach to fixing this would be to change the cap >> functions to only apply their checks if setting the capability >> attribute and defer any checks on other security.* attributes to either >> the security framework or the other security modules. Then the >> framework could always call all the modules on the inode_setxattr and >> inode_removexattr hooks as with other hooks. The security framework >> would then need to ensure that a check is still applied when setting >> security.* attributes if it isn't already handled by one of the enabled >> security modules, as you don't want unprivileged userspace to be able >> to set arbitrary security.foo attributes or to set up security.selinux >> or security.SMACK* attributes if those modules happen to be disabled. > > Agreed. This isn't a two line change. Grumble. > > I can guess at what the problem might be, but I hate making > assumptions when I go to fix a problem. I will start looking > at a patch, but it would really help if I could say for sure > what I'm out to accomplish. It may be obvious to the casual > observer, but that description has not been applied to me very > often. Apologies for the delayed reply. I am looking at security_inode_setxattr. For setting attributes in the security.* the generic code in fs/xattr.c applies no permission checks. Each security module that implements an xattr in security.* then imposes it's own policy on it's own attribute. For smack the basic rule is smack_privileged(CAP_MAC_ADMIN). For selinux the basic rule is inode_or_owner_capable(inode). For commoncap the basic rule is capable_wrt_inode_uidgid(inode, CAP_SETFCAP). commoncap also applies a default policity to setting security.* xattrs. ns_capable(dentry->d_sb->s_userns, CAP_SYS_ADMIN). smack reuses that default policy by calling cap_inode_setxattr if it isn't a smack security.* xattr. selinux has what looks like an old copy of the commoncap checks for the security.* in selinux_inode_setotherxattr. Testing for capable(CAP_SETFCAP) for security.capable and capable(CAP_SYS_ADMIN) for the others. With the added complication that selinux calls selinux_inode_setotherxattr also for the remove_xattr case. So fixing this in selinux_inode_setotherxattr is not appropriate. I believe selinux also has general policy hooks it applies to all invocations of setxattr. So I think to really fix this we need to separate the cases of is this your security modules attribute from general policy checks added by the security modules. Perhaps something like this for security_inode_setxattr: Hmm. Looking at least ima also has the distinction between protecting it's own xattr writes and running generaly security module policy on xattr writes. int security_inode_setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags) { int ret = 0; if (unlikely(IS_PRIVATE(d_backing_inode(dentry)))) return 0; if (strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) == 0) { /* Call the security modules and see if they all return * -EOPNOTSUPP if so apply the default permission * check of ns_capable(dentry->d_sb->s_user_ns, CAP_SYS_ADMIN) * otherwise if one of the security modules supports * this attribute (signaled by returning something other * -EOPNOTSUPP) then set ret to that result. * * The security modules include at least smack, selinux, * commoncap, ima, and evm. */ ret = magic_inode_protect_setxattr(dentry, name, value, size); } if (ret) return ret; /* Run all of the security module policy against this setxattr call */ return magic_inode_policy_setxattr(dentry, name, value, size); } Eric From ebiederm at xmission.com Sat Sep 30 20:40:43 2017 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 30 Sep 2017 15:40:43 -0500 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: (Casey Schaufler's message of "Sat, 30 Sep 2017 10:01:48 -0700") References: <87tvzmqwoi.fsf@xmission.com> <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> <1506694737.5571.9.camel@tycho.nsa.gov> <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com> <87vak0ma00.fsf@xmission.com> Message-ID: <87d167ncms.fsf@xmission.com> Casey Schaufler writes: > On 9/30/2017 9:22 AM, Eric W. Biederman wrote: >> Casey Schaufler writes: >> >>> On 9/29/2017 7:18 AM, Stephen Smalley wrote: >>>> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote: >>>>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote: >>>>>> It looks like once upon a time a long time ago selinux copied code >>>>>> from cap_inode_removexattr and cap_inode_setxattr into >>>>>> selinux_inode_setotherxattr.??However the code has now diverged and >>>>>> selinux is implementing a policy that is quite different than >>>>>> cap_inode_setxattr and cap_inode_removexattr especially when it >>>>>> comes >>>>>> to the security.capable xattr. >>>>> What leads you to believe that this isn't intentional? >>>>> It's most likely the case that this change occurred as >>>>> part of the first round module stacking change. What behavior >>>>> do you see that you're unhappy with? >>>>> >>>>>> To keep things working >>>>> Which "things"? How are they not "working"? >>>>> >>>>>> ?and to make the comments in security/security.c >>>>>> correct when the xattr is securit.capable, call cap_inode_setxattr >>>>>> or cap_inode_removexattr as appropriate. >>>>>> >>>>>> I suspect there is a larger conversation to be had here but this >>>>>> is enough to keep selinux from implementing a non-sense hard coded >>>>>> policy that breaks other parts of the kernel. >>>>> Specifics, please. Since I can't guess what problem you've >>>>> encountered I can't tell if it's here, in the infrastructure, >>>>> or in your perception of what constitutes "broken". >>>>> >>>>>> Signed-off-by: "Eric W. Biederman" >>>>>> --- >>>>>> ?security/selinux/hooks.c | 6 ++++++ >>>>>> ?1 file changed, 6 insertions(+) >>>>>> >>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c >>>>>> index f5d304736852..edf4bd292dc7 100644 >>>>>> --- a/security/selinux/hooks.c >>>>>> +++ b/security/selinux/hooks.c >>>>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct >>>>>> dentry *dentry, const char *name, >>>>>> ? u32 newsid, sid = current_sid(); >>>>>> ? int rc = 0; >>>>>> ? >>>>>> + if (strcmp(name, XATTR_NAME_CAPS) == 0) >>>>>> + return cap_inode_setxattr(dentry, name, value, >>>>>> size, flags); >>>>>> + >>>>> No. Don't even think of contemplating considering embedding the cap >>>>> attribute check in the SELinux code. cap_inode_setxattr() is called >>>>> in >>>>> the infrastructure. >>>> Except that it isn't, not if any other security module is enabled and >>>> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when >>>> setting security.selinux or security.SMACK*. >>> OK. Yes, this bit of the infrastructure is some of the >>> worst I've done in a long time. This is a case where we >>> already need special case stacking infrastructure. It looks >>> like we'll have to separate setting the cap attribute from >>> checking the cap state in order to make this work. In any >>> case, the security_inode_setxattr() code is where the change >>> belongs. There will likely be fallout changes in the modules, >>> including the cap module. >>> ? >>> >>>> An alternative approach to fixing this would be to change the cap >>>> functions to only apply their checks if setting the capability >>>> attribute and defer any checks on other security.* attributes to either >>>> the security framework or the other security modules. Then the >>>> framework could always call all the modules on the inode_setxattr and >>>> inode_removexattr hooks as with other hooks. The security framework >>>> would then need to ensure that a check is still applied when setting >>>> security.* attributes if it isn't already handled by one of the enabled >>>> security modules, as you don't want unprivileged userspace to be able >>>> to set arbitrary security.foo attributes or to set up security.selinux >>>> or security.SMACK* attributes if those modules happen to be disabled. >>> Agreed. This isn't a two line change. Grumble. >>> >>> I can guess at what the problem might be, but I hate making >>> assumptions when I go to fix a problem. I will start looking >>> at a patch, but it would really help if I could say for sure >>> what I'm out to accomplish. It may be obvious to the casual >>> observer, but that description has not been applied to me very >>> often. >> Apologies for the delayed reply. >> >> I am looking at security_inode_setxattr. >> >> For setting attributes in the security.* the generic code in fs/xattr.c >> applies no permission checks. >> >> Each security module that implements an xattr in security.* then imposes >> it's own policy on it's own attribute. >> >> For smack the basic rule is smack_privileged(CAP_MAC_ADMIN). >> For selinux the basic rule is inode_or_owner_capable(inode). >> For commoncap the basic rule is capable_wrt_inode_uidgid(inode, CAP_SETFCAP). >> >> commoncap also applies a default policity to setting security.* xattrs. >> ns_capable(dentry->d_sb->s_userns, CAP_SYS_ADMIN). >> >> smack reuses that default policy by calling cap_inode_setxattr if it >> isn't a smack security.* xattr. >> >> selinux has what looks like an old copy of the commoncap checks for >> the security.* in selinux_inode_setotherxattr. Testing for >> capable(CAP_SETFCAP) for security.capable and capable(CAP_SYS_ADMIN) >> for the others. >> >> With the added complication that selinux calls >> selinux_inode_setotherxattr also for the remove_xattr case. So fixing >> this in selinux_inode_setotherxattr is not appropriate. >> >> I believe selinux also has general policy hooks it applies to all >> invocations of setxattr. >> >> So I think to really fix this we need to separate the cases of is this >> your security modules attribute from general policy checks added by the >> security modules. Perhaps something like this for >> security_inode_setxattr: >> >> Hmm. Looking at least ima also has the distinction between protecting >> it's own xattr writes and running generaly security module policy on >> xattr writes. >> >> int security_inode_setxattr(struct dentry *dentry, const char *name, >> const void *value, size_t size, int flags) >> { >> int ret = 0; >> >> if (unlikely(IS_PRIVATE(d_backing_inode(dentry)))) >> return 0; >> >> if (strncmp(name, XATTR_SECURITY_PREFIX, >> sizeof(XATTR_SECURITY_PREFIX) - 1) == 0) { >> /* Call the security modules and see if they all return >> * -EOPNOTSUPP if so apply the default permission >> * check of ns_capable(dentry->d_sb->s_user_ns, CAP_SYS_ADMIN) >> * otherwise if one of the security modules supports >> * this attribute (signaled by returning something other >> * -EOPNOTSUPP) then set ret to that result. >> * >> * The security modules include at least smack, selinux, >> * commoncap, ima, and evm. >> */ >> ret = magic_inode_protect_setxattr(dentry, name, value, size); >> } >> if (ret) >> return ret; >> >> /* Run all of the security module policy against this setxattr call */ >> return magic_inode_policy_setxattr(dentry, name, value, size); >> } >> >> Eric > > Yup, that's pretty much what I'm thinking. It's unfortunate > that the magic_ API isn't fully implemented. There's going to > be a good deal of code surgery instead. Is there an observed > problem today? This is going to have to get addressed for stacking, > so if there isn't a behavioral issue that impacts something real > I would like to defer spending significant time on it. Do you have > a case where this is not working correctly? Merged as of 4.14-rc1 is the support for user namespace root to set sercurity.capable. This fails when selinux is loaded. removexattr has the same problem and the code is a little less convoluted in that case. Not being able to set the capability when you should be able to is very noticable. Like running into a brick wall noticable. Which is where the minimal patch for selinux comes in. I think it solves the exact case in question, even if it isn't the perfect long term solution. Eric From casey at schaufler-ca.com Sat Sep 30 23:22:12 2017 From: casey at schaufler-ca.com (Casey Schaufler) Date: Sat, 30 Sep 2017 16:22:12 -0700 Subject: [RFC][PATCH] security: Make the selinux setxattr and removexattr hooks behave In-Reply-To: <87d167ncms.fsf@xmission.com> References: <87tvzmqwoi.fsf@xmission.com> <1913d5c4-64ef-36c1-e8ad-c779ff5c7995@schaufler-ca.com> <1506694737.5571.9.camel@tycho.nsa.gov> <6f293107-6ff9-c4c7-f682-207a546c5061@schaufler-ca.com> <87vak0ma00.fsf@xmission.com> <87d167ncms.fsf@xmission.com> Message-ID: On 9/30/2017 1:40 PM, Eric W. Biederman wrote: > Casey Schaufler writes: > >> On 9/30/2017 9:22 AM, Eric W. Biederman wrote: >>> Casey Schaufler writes: >>> >>>> On 9/29/2017 7:18 AM, Stephen Smalley wrote: >>>>> On Thu, 2017-09-28 at 18:16 -0700, Casey Schaufler wrote: >>>>>> On 9/28/2017 3:34 PM, Eric W. Biederman wrote: >>>>>>> It looks like once upon a time a long time ago selinux copied code >>>>>>> from cap_inode_removexattr and cap_inode_setxattr into >>>>>>> selinux_inode_setotherxattr.??However the code has now diverged and >>>>>>> selinux is implementing a policy that is quite different than >>>>>>> cap_inode_setxattr and cap_inode_removexattr especially when it >>>>>>> comes >>>>>>> to the security.capable xattr. >>>>>> What leads you to believe that this isn't intentional? >>>>>> It's most likely the case that this change occurred as >>>>>> part of the first round module stacking change. What behavior >>>>>> do you see that you're unhappy with? >>>>>> >>>>>>> To keep things working >>>>>> Which "things"? How are they not "working"? >>>>>> >>>>>>> ?and to make the comments in security/security.c >>>>>>> correct when the xattr is securit.capable, call cap_inode_setxattr >>>>>>> or cap_inode_removexattr as appropriate. >>>>>>> >>>>>>> I suspect there is a larger conversation to be had here but this >>>>>>> is enough to keep selinux from implementing a non-sense hard coded >>>>>>> policy that breaks other parts of the kernel. >>>>>> Specifics, please. Since I can't guess what problem you've >>>>>> encountered I can't tell if it's here, in the infrastructure, >>>>>> or in your perception of what constitutes "broken". >>>>>> >>>>>>> Signed-off-by: "Eric W. Biederman" >>>>>>> --- >>>>>>> ?security/selinux/hooks.c | 6 ++++++ >>>>>>> ?1 file changed, 6 insertions(+) >>>>>>> >>>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c >>>>>>> index f5d304736852..edf4bd292dc7 100644 >>>>>>> --- a/security/selinux/hooks.c >>>>>>> +++ b/security/selinux/hooks.c >>>>>>> @@ -3167,6 +3167,9 @@ static int selinux_inode_setxattr(struct >>>>>>> dentry *dentry, const char *name, >>>>>>> ? u32 newsid, sid = current_sid(); >>>>>>> ? int rc = 0; >>>>>>> ? >>>>>>> + if (strcmp(name, XATTR_NAME_CAPS) == 0) >>>>>>> + return cap_inode_setxattr(dentry, name, value, >>>>>>> size, flags); >>>>>>> + >>>>>> No. Don't even think of contemplating considering embedding the cap >>>>>> attribute check in the SELinux code. cap_inode_setxattr() is called >>>>>> in >>>>>> the infrastructure. >>>>> Except that it isn't, not if any other security module is enabled and >>>>> implements those hooks, to prevent imposing CAP_SYS_ADMIN checks when >>>>> setting security.selinux or security.SMACK*. >>>> OK. Yes, this bit of the infrastructure is some of the >>>> worst I've done in a long time. This is a case where we >>>> already need special case stacking infrastructure. It looks >>>> like we'll have to separate setting the cap attribute from >>>> checking the cap state in order to make this work. In any >>>> case, the security_inode_setxattr() code is where the change >>>> belongs. There will likely be fallout changes in the modules, >>>> including the cap module. >>>> ? >>>> >>>>> An alternative approach to fixing this would be to change the cap >>>>> functions to only apply their checks if setting the capability >>>>> attribute and defer any checks on other security.* attributes to either >>>>> the security framework or the other security modules. Then the >>>>> framework could always call all the modules on the inode_setxattr and >>>>> inode_removexattr hooks as with other hooks. The security framework >>>>> would then need to ensure that a check is still applied when setting >>>>> security.* attributes if it isn't already handled by one of the enabled >>>>> security modules, as you don't want unprivileged userspace to be able >>>>> to set arbitrary security.foo attributes or to set up security.selinux >>>>> or security.SMACK* attributes if those modules happen to be disabled. >>>> Agreed. This isn't a two line change. Grumble. >>>> >>>> I can guess at what the problem might be, but I hate making >>>> assumptions when I go to fix a problem. I will start looking >>>> at a patch, but it would really help if I could say for sure >>>> what I'm out to accomplish. It may be obvious to the casual >>>> observer, but that description has not been applied to me very >>>> often. >>> Apologies for the delayed reply. >>> >>> I am looking at security_inode_setxattr. >>> >>> For setting attributes in the security.* the generic code in fs/xattr.c >>> applies no permission checks. >>> >>> Each security module that implements an xattr in security.* then imposes >>> it's own policy on it's own attribute. >>> >>> For smack the basic rule is smack_privileged(CAP_MAC_ADMIN). >>> For selinux the basic rule is inode_or_owner_capable(inode). >>> For commoncap the basic rule is capable_wrt_inode_uidgid(inode, CAP_SETFCAP). >>> >>> commoncap also applies a default policity to setting security.* xattrs. >>> ns_capable(dentry->d_sb->s_userns, CAP_SYS_ADMIN). >>> >>> smack reuses that default policy by calling cap_inode_setxattr if it >>> isn't a smack security.* xattr. >>> >>> selinux has what looks like an old copy of the commoncap checks for >>> the security.* in selinux_inode_setotherxattr. Testing for >>> capable(CAP_SETFCAP) for security.capable and capable(CAP_SYS_ADMIN) >>> for the others. >>> >>> With the added complication that selinux calls >>> selinux_inode_setotherxattr also for the remove_xattr case. So fixing >>> this in selinux_inode_setotherxattr is not appropriate. >>> >>> I believe selinux also has general policy hooks it applies to all >>> invocations of setxattr. >>> >>> So I think to really fix this we need to separate the cases of is this >>> your security modules attribute from general policy checks added by the >>> security modules. Perhaps something like this for >>> security_inode_setxattr: >>> >>> Hmm. Looking at least ima also has the distinction between protecting >>> it's own xattr writes and running generaly security module policy on >>> xattr writes. >>> >>> int security_inode_setxattr(struct dentry *dentry, const char *name, >>> const void *value, size_t size, int flags) >>> { >>> int ret = 0; >>> >>> if (unlikely(IS_PRIVATE(d_backing_inode(dentry)))) >>> return 0; >>> >>> if (strncmp(name, XATTR_SECURITY_PREFIX, >>> sizeof(XATTR_SECURITY_PREFIX) - 1) == 0) { >>> /* Call the security modules and see if they all return >>> * -EOPNOTSUPP if so apply the default permission >>> * check of ns_capable(dentry->d_sb->s_user_ns, CAP_SYS_ADMIN) >>> * otherwise if one of the security modules supports >>> * this attribute (signaled by returning something other >>> * -EOPNOTSUPP) then set ret to that result. >>> * >>> * The security modules include at least smack, selinux, >>> * commoncap, ima, and evm. >>> */ >>> ret = magic_inode_protect_setxattr(dentry, name, value, size); >>> } >>> if (ret) >>> return ret; >>> >>> /* Run all of the security module policy against this setxattr call */ >>> return magic_inode_policy_setxattr(dentry, name, value, size); >>> } >>> >>> Eric >> Yup, that's pretty much what I'm thinking. It's unfortunate >> that the magic_ API isn't fully implemented. There's going to >> be a good deal of code surgery instead. Is there an observed >> problem today? This is going to have to get addressed for stacking, >> so if there isn't a behavioral issue that impacts something real >> I would like to defer spending significant time on it. Do you have >> a case where this is not working correctly? > Merged as of 4.14-rc1 is the support for user namespace root to set > sercurity.capable. This fails when selinux is loaded. OK. Is the failure unique to SELinux, or does it fail with Smack as well? > removexattr has the same problem and the code is a little less > convoluted in that case. Right. Because removexattr is a simpler situation. > Not being able to set the capability when you should be able to is > very noticable. Like running into a brick wall noticable. Ah, now you've identified the problem. Yes, I would agree that you've uncovered an undesirable behavior. > Which is where the minimal patch for selinux comes in. I think it > solves the exact case in question, even if it isn't the perfect long > term solution. If the problem is unique to SELinux I can see your logic. If it isn't, that is, if it's also a problem with any other security module, there either needs to be a fix for that/those module/s as well or a "real" fix. I'm not opposed to the SELinux short term fix if you can say that that's the only module with the problem. > > Eric > > -- > To unsubscribe from this list: send the line "unsubscribe linux-security-module" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > .