From christian.brauner at canonical.com Thu Feb 1 08:57:44 2018 From: christian.brauner at canonical.com (Christian Brauner) Date: Thu, 1 Feb 2018 09:57:44 +0100 Subject: rtnetlink updates in 4.16 Message-ID: <20180201085743.eqxbspsy6yg6n6yl@gmail.com> Hi, Pushed some patches that might be interesting to container runtimes. The gist is: rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,NEW,SET}LINK The series enables passing a IFLA_IF_NETNSID property along with RTM_{DEL,NEW,SET}LINK requests to perform operations on network namespaces without the need to setns() to the corresponding network namespace (and its owning user namespace). For workloads employing user namespaces and network namespaces this effectively let's you avoid spawning an additional helper processes. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7c4f63ba824302492985553018881455982241d6 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c310bfcb6e1be993629c5747accf8e1c65fbb255 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b61ad68a9fe85d29d5363eb36860164a049723cf https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5bb8ed075428b71492734af66230aa0c07fcc515 Thanks! Christian From paul at paul-moore.com Fri Feb 2 21:18:41 2018 From: paul at paul-moore.com (Paul Moore) Date: Fri, 2 Feb 2018 16:18:41 -0500 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: <1515514736.3239.10.camel@redhat.com> References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> Message-ID: On Tue, Jan 9, 2018 at 11:18 AM, Simo Sorce wrote: > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: ... >> Changelog: >> >> (Upstream V3) >> - switch back to u64 (from pmoore, can be expanded to u128 in future if >> need arises without breaking API. u32 was originally proposed, up to >> c36 discussed) >> - write-once, but children inherit audit container identifier and can >> then still be written once >> - switch to CAP_AUDIT_CONTROL >> - group namespace actions together, auxilliary records to namespace >> operations. >> >> (Upstream V2) >> - switch from u64 to u128 UUID >> - switch from "signal" and "trigger" to "register" >> - restrict registration to single process or force all threads and >> children into same container > > I am trying to understand the back and forth on the ID size. I'm just now getting a chance to read Richard's latest draft, but I wanted to comment on this quickly. There are two main reasons for keeping this a 32 or 64 bit integer: 1) After the initial "be able to associate audit events with a container" stage, we are going to look into supporting multiple audit daemons on the system so that you could run an audit daemon inside a container and it would collect events generated by the container (we're tentatively calling this "phase 2", feel free to insert your own "magic happens" joke). There are a lot things that need to happen in phase two, one of these things is the addition of an audit event routing mechanism that will send audit records to the right audit daemons (the "host" daemon will always see everything), in order to do this we will need to be able to quickly compare audit container IDs, this means an integer. 2) Whatever we pick for an audit container ID it is going to be wrong for at least one container orchestrator. There is no "one" solution here, so we are providing a small and flexible mechanism that higher level orchestrators can use to provide a more complete solution. > >From an orchestrator POV anything that requires tracking a node > specific ID is not ideal. > > Orchestrators tend to span many nodes, and containers tend to have IDs > that are either UUID or have a Hash (like SHA256) as identifier. You're helping me prove my reason #2. > The problem here is two-fold: > > a) Your auditing requires some mapping to be useful outside of the > system. > If you aggreggate audit logs outside of the system or you want to > correlate the system audit logs with other components dealing with > containers, now you need a place where you provide a mapping from your > audit u64 to the ID a container has in the rest of the system. Yep, see my reason #2. I want us to have something that "works" for a single system as well as something that can be leveraged by higher level tools for large networks of machines. I realize it's easy, and tempting, to expand the scope of this effort; but if we are to have any success it is only going to be through some discipline. We need to focus on a small solution which addresses the basic needs and hopefully remains flexible enough for any potential expansion while staying palatable to the audit folks and the general kernel community. > b) Now you need a mapping of some sort. The simplest way a container > orchestrator can go about this is to just use the UUID or Hash > representing their view of the container, truncate it to a u64 and use > that for Audit. This means there are some chances there will be a > collision and a duplicate u64 ID will be used by the orchestrator as > the container ID. What happen in that case ? That is a design decision left to the different container orchestrators. -- paul moore www.paul-moore.com From paul at paul-moore.com Fri Feb 2 21:24:26 2018 From: paul at paul-moore.com (Paul Moore) Date: Fri, 2 Feb 2018 16:24:26 -0500 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> Message-ID: On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs wrote: > On 2018-01-09 11:18, Simo Sorce wrote: >> On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: >> > Containers are a userspace concept. The kernel knows nothing of them. >> > >> > The Linux audit system needs a way to be able to track the container >> > provenance of events and actions. Audit needs the kernel's help to do >> > this. >> > >> > Since the concept of a container is entirely a userspace concept, a >> > registration from the userspace container orchestration system initiates >> > this. This will define a point in time and a set of resources >> > associated with a particular container with an audit container >> > identifier. >> > >> > The registration is a u64 representing the audit container identifier >> > written to a special file in a pseudo filesystem (proc, since PID tree >> > already exists) representing a process that will become a parent process >> > in that container. This write might place restrictions on mount >> > namespaces required to define a container, or at least careful checking >> > of namespaces in the kernel to verify permissions of the orchestrator so >> > it can't change its own container ID. A bind mount of nsfs may be >> > necessary in the container orchestrator's mount namespace. This write >> > can only happen once per process. >> > >> > Note: The justification for using a u64 is that it minimizes the >> > information printed in every audit record, reducing bandwidth and limits >> > comparisons to a single u64 which will be faster and less error-prone. >> > >> > Require CAP_AUDIT_CONTROL to be able to carry out the registration. At >> > that time, record the target container's user-supplied audit container >> > identifier along with a target container's parent process (which may >> > become the target container's "init" process) process ID (referenced >> > from the initial PID namespace) in a new record AUDIT_CONTAINER with a >> > qualifying op=$action field. >> > >> > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid >> > container ID present on an auditable action or event. >> > >> > Forked and cloned processes inherit their parent's audit container >> > identifier, referenced in the process' task_struct. Since the audit >> > container identifier is inherited rather than written, it can still be >> > written once. This will prevent tampering while allowing nesting. >> > (This can be implemented with an internal settable flag upon >> > registration that does not get copied across a fork/clone.) >> > >> > Mimic setns(2) and return an error if the process has already initiated >> > threading or forked since this registration should happen before the >> > process execution is started by the orchestrator and hence should not >> > yet have any threads or children. If this is deemed overly restrictive, >> > switch all of the target's threads and children to the new containerID. >> > >> > Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL. >> > >> > When a container ceases to exist because the last process in that >> > container has exited log the fact to balance the registration action. >> > (This is likely needed for certification accountability.) >> > >> > At this point it appears unnecessary to add a container session >> > identifier since this is all tracked from loginuid and sessionid to >> > communicate with the container orchestrator to spawn an additional >> > session into an existing container which would be logged. It can be >> > added at a later date without breaking API should it be deemed >> > necessary. >> > >> > The following namespace logging actions are not needed for certification >> > purposes at this point, but are helpful for tracking namespace activity. >> > These are auxilliary records that are associated with namespace >> > manipulation syscalls unshare(2), clone(2) and setns(2), so the records >> > will only show up if explicit syscall rules have been added to document >> > this activity. >> > >> > Log the creation of every namespace, inheriting/adding its spawning >> > process' audit container identifier(s), if applicable. Include the >> > spawning and spawned namespace IDs (device and inode number tuples). >> > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] >> > Note: At this point it appears only network namespaces may need to track >> > container IDs apart from processes since incoming packets may cause an >> > auditable event before being associated with a process. Since a >> > namespace can be shared by processes in different containers, the >> > namespace will need to track all containers to which it has been >> > assigned. >> > >> > Upon registration, the target process' namespace IDs (in the form of a >> > nsfs device number and inode number tuple) will be recorded in an >> > AUDIT_NS_INFO auxilliary record. >> > >> > Log the destruction of every namespace that is no longer used by any >> > process, including the namespace IDs (device and inode number tuples). >> > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] >> > >> > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) >> > the parent and child namespace IDs for any changes to a process' >> > namespaces. [setns(2)] >> > Note: It may be possible to combine AUDIT_NS_* record formats and >> > distinguish them with an op=$action field depending on the fields >> > required for each message type. >> > >> > The audit container identifier will need to be reaped from all >> > implicated namespaces upon the destruction of a container. >> > >> > This namespace information adds supporting information for tracking >> > events not attributable to specific processes. >> > >> > Changelog: >> > >> > (Upstream V3) >> > - switch back to u64 (from pmoore, can be expanded to u128 in future if >> > need arises without breaking API. u32 was originally proposed, up to >> > c36 discussed) >> > - write-once, but children inherit audit container identifier and can >> > then still be written once >> > - switch to CAP_AUDIT_CONTROL >> > - group namespace actions together, auxilliary records to namespace >> > operations. >> > >> > (Upstream V2) >> > - switch from u64 to u128 UUID >> > - switch from "signal" and "trigger" to "register" >> > - restrict registration to single process or force all threads and >> > children into same container >> >> I am trying to understand the back and forth on the ID size. >> >> From an orchestrator POV anything that requires tracking a node >> specific ID is not ideal. >> >> Orchestrators tend to span many nodes, and containers tend to have IDs >> that are either UUID or have a Hash (like SHA256) as identifier. >> >> The problem here is two-fold: >> >> a) Your auditing requires some mapping to be useful outside of the >> system. >> If you aggreggate audit logs outside of the system or you want to >> correlate the system audit logs with other components dealing with >> containers, now you need a place where you provide a mapping from your >> audit u64 to the ID a container has in the rest of the system. >> >> b) Now you need a mapping of some sort. The simplest way a container >> orchestrator can go about this is to just use the UUID or Hash >> representing their view of the container, truncate it to a u64 and use >> that for Audit. This means there are some chances there will be a >> collision and a duplicate u64 ID will be used by the orchestrator as >> the container ID. What happen in that case ? > > Paul, can you justify this somewhat larger inconvenience for some > relatively minor convenience on our part? Done in direct response to Simo. But to be clear Richard, we've talked about this a few times, it's not a "minor convenience" on our part, it's a pretty big convenience once we starting having to route audit events and make decisions based on the audit container ID information. Audit performance is less than awesome now, I'm working hard to not make it worse. > u64 vs u128 is easy for us to > accomodate in terms of scalar comparisons. It doubles the information > in every container id field we print in audit records. ... and slows down audit container ID checks. > A c36 is a bigger step. Yeah, we're not doing that, no way. -- paul moore www.paul-moore.com From paul at paul-moore.com Fri Feb 2 22:05:22 2018 From: paul at paul-moore.com (Paul Moore) Date: Fri, 2 Feb 2018 17:05:22 -0500 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> Message-ID: On Tue, Jan 9, 2018 at 7:16 AM, Richard Guy Briggs wrote: > Containers are a userspace concept. The kernel knows nothing of them. > > The Linux audit system needs a way to be able to track the container > provenance of events and actions. Audit needs the kernel's help to do > this. Two small comments below, but I tend to think we are at a point where you can start cobbling together some prototype/RFC patches. Surely there are going to be a few changes, and new comments, that come out once we see an initial implementation so let's see what those are. > The registration is a u64 representing the audit container identifier > written to a special file in a pseudo filesystem (proc, since PID tree > already exists) representing a process that will become a parent process > in that container. This write might place restrictions on mount > namespaces required to define a container, or at least careful checking > of namespaces in the kernel to verify permissions of the orchestrator so > it can't change its own container ID. A bind mount of nsfs may be > necessary in the container orchestrator's mount namespace. This write > can only happen once per process. > > Note: The justification for using a u64 is that it minimizes the > information printed in every audit record, reducing bandwidth and limits > comparisons to a single u64 which will be faster and less error-prone. I know Steve generally worries about audit record size, which is a perfectly valid concern in this case, I also worry about the additional overhead when we start routing audit records to multiple audit daemons (see my other emails in this thread). > ... > When a container ceases to exist because the last process in that > container has exited log the fact to balance the registration action. > (This is likely needed for certification accountability.) On the "container ceases to exist" point, I expect this "container dead" message to come from the orchestrator and not the kernel itself (I don't want the kernel to have to handle that level of bookkeeping). I imagine this should be similar to what is done for VM auditing with libvirt. -- paul moore www.paul-moore.com From simo at redhat.com Fri Feb 2 22:19:06 2018 From: simo at redhat.com (Simo Sorce) Date: Fri, 02 Feb 2018 17:19:06 -0500 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> Message-ID: <1517609946.13097.161.camel@redhat.com> On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote: > On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs wrote: > > On 2018-01-09 11:18, Simo Sorce wrote: > > > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: > > > > Containers are a userspace concept. The kernel knows nothing of them. > > > > > > > > The Linux audit system needs a way to be able to track the container > > > > provenance of events and actions. Audit needs the kernel's help to do > > > > this. > > > > > > > > Since the concept of a container is entirely a userspace concept, a > > > > registration from the userspace container orchestration system initiates > > > > this. This will define a point in time and a set of resources > > > > associated with a particular container with an audit container > > > > identifier. > > > > > > > > The registration is a u64 representing the audit container identifier > > > > written to a special file in a pseudo filesystem (proc, since PID tree > > > > already exists) representing a process that will become a parent process > > > > in that container. This write might place restrictions on mount > > > > namespaces required to define a container, or at least careful checking > > > > of namespaces in the kernel to verify permissions of the orchestrator so > > > > it can't change its own container ID. A bind mount of nsfs may be > > > > necessary in the container orchestrator's mount namespace. This write > > > > can only happen once per process. > > > > > > > > Note: The justification for using a u64 is that it minimizes the > > > > information printed in every audit record, reducing bandwidth and limits > > > > comparisons to a single u64 which will be faster and less error-prone. > > > > > > > > Require CAP_AUDIT_CONTROL to be able to carry out the registration. At > > > > that time, record the target container's user-supplied audit container > > > > identifier along with a target container's parent process (which may > > > > become the target container's "init" process) process ID (referenced > > > > from the initial PID namespace) in a new record AUDIT_CONTAINER with a > > > > qualifying op=$action field. > > > > > > > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > > > > container ID present on an auditable action or event. > > > > > > > > Forked and cloned processes inherit their parent's audit container > > > > identifier, referenced in the process' task_struct. Since the audit > > > > container identifier is inherited rather than written, it can still be > > > > written once. This will prevent tampering while allowing nesting. > > > > (This can be implemented with an internal settable flag upon > > > > registration that does not get copied across a fork/clone.) > > > > > > > > Mimic setns(2) and return an error if the process has already initiated > > > > threading or forked since this registration should happen before the > > > > process execution is started by the orchestrator and hence should not > > > > yet have any threads or children. If this is deemed overly restrictive, > > > > switch all of the target's threads and children to the new containerID. > > > > > > > > Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL. > > > > > > > > When a container ceases to exist because the last process in that > > > > container has exited log the fact to balance the registration action. > > > > (This is likely needed for certification accountability.) > > > > > > > > At this point it appears unnecessary to add a container session > > > > identifier since this is all tracked from loginuid and sessionid to > > > > communicate with the container orchestrator to spawn an additional > > > > session into an existing container which would be logged. It can be > > > > added at a later date without breaking API should it be deemed > > > > necessary. > > > > > > > > The following namespace logging actions are not needed for certification > > > > purposes at this point, but are helpful for tracking namespace activity. > > > > These are auxilliary records that are associated with namespace > > > > manipulation syscalls unshare(2), clone(2) and setns(2), so the records > > > > will only show up if explicit syscall rules have been added to document > > > > this activity. > > > > > > > > Log the creation of every namespace, inheriting/adding its spawning > > > > process' audit container identifier(s), if applicable. Include the > > > > spawning and spawned namespace IDs (device and inode number tuples). > > > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > > > > Note: At this point it appears only network namespaces may need to track > > > > container IDs apart from processes since incoming packets may cause an > > > > auditable event before being associated with a process. Since a > > > > namespace can be shared by processes in different containers, the > > > > namespace will need to track all containers to which it has been > > > > assigned. > > > > > > > > Upon registration, the target process' namespace IDs (in the form of a > > > > nsfs device number and inode number tuple) will be recorded in an > > > > AUDIT_NS_INFO auxilliary record. > > > > > > > > Log the destruction of every namespace that is no longer used by any > > > > process, including the namespace IDs (device and inode number tuples). > > > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > > > > > > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > > > > the parent and child namespace IDs for any changes to a process' > > > > namespaces. [setns(2)] > > > > Note: It may be possible to combine AUDIT_NS_* record formats and > > > > distinguish them with an op=$action field depending on the fields > > > > required for each message type. > > > > > > > > The audit container identifier will need to be reaped from all > > > > implicated namespaces upon the destruction of a container. > > > > > > > > This namespace information adds supporting information for tracking > > > > events not attributable to specific processes. > > > > > > > > Changelog: > > > > > > > > (Upstream V3) > > > > - switch back to u64 (from pmoore, can be expanded to u128 in future if > > > > need arises without breaking API. u32 was originally proposed, up to > > > > c36 discussed) > > > > - write-once, but children inherit audit container identifier and can > > > > then still be written once > > > > - switch to CAP_AUDIT_CONTROL > > > > - group namespace actions together, auxilliary records to namespace > > > > operations. > > > > > > > > (Upstream V2) > > > > - switch from u64 to u128 UUID > > > > - switch from "signal" and "trigger" to "register" > > > > - restrict registration to single process or force all threads and > > > > children into same container > > > > > > I am trying to understand the back and forth on the ID size. > > > > > > From an orchestrator POV anything that requires tracking a node > > > specific ID is not ideal. > > > > > > Orchestrators tend to span many nodes, and containers tend to have IDs > > > that are either UUID or have a Hash (like SHA256) as identifier. > > > > > > The problem here is two-fold: > > > > > > a) Your auditing requires some mapping to be useful outside of the > > > system. > > > If you aggreggate audit logs outside of the system or you want to > > > correlate the system audit logs with other components dealing with > > > containers, now you need a place where you provide a mapping from your > > > audit u64 to the ID a container has in the rest of the system. > > > > > > b) Now you need a mapping of some sort. The simplest way a container > > > orchestrator can go about this is to just use the UUID or Hash > > > representing their view of the container, truncate it to a u64 and use > > > that for Audit. This means there are some chances there will be a > > > collision and a duplicate u64 ID will be used by the orchestrator as > > > the container ID. What happen in that case ? > > > > Paul, can you justify this somewhat larger inconvenience for some > > relatively minor convenience on our part? > > Done in direct response to Simo. Sorry but your response sounds more like waving away then addressing them, the excuse being: we can't please everyone, so we are going to please no one. > But to be clear Richard, we've talked about this a few times, it's not > a "minor convenience" on our part, it's a pretty big convenience once > we starting having to route audit events and make decisions based on > the audit container ID information. Audit performance is less than > awesome now, I'm working hard to not make it worse. Sounds like a security vs performance trade off to me. > > u64 vs u128 is easy for us to > > accomodate in terms of scalar comparisons. It doubles the information > > in every container id field we print in audit records. > > ... and slows down audit container ID checks. Are you saying a cmp on a u128 is slower than a comparison on a u64 and this is something that will be noticeable ? > > A c36 is a bigger step. > > Yeah, we're not doing that, no way. Ok, I can see your point though I do not agree with it. I can see why you do not want to have arbitrary length strings, but a u128 sounded like a reasonable compromise to me as it has enough room to be able to have unique cluster-wide IDs which a u64 definitely makes a lot harder to provide w/o tight coordination. Simo. -- Simo Sorce Sr. Principal Software Engineer Red Hat, Inc From paul at paul-moore.com Fri Feb 2 23:24:47 2018 From: paul at paul-moore.com (Paul Moore) Date: Fri, 2 Feb 2018 18:24:47 -0500 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: <1517609946.13097.161.camel@redhat.com> References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> <1517609946.13097.161.camel@redhat.com> Message-ID: On Fri, Feb 2, 2018 at 5:19 PM, Simo Sorce wrote: > On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote: >> On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs wrote: >> > On 2018-01-09 11:18, Simo Sorce wrote: >> > > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: ... >> > Paul, can you justify this somewhat larger inconvenience for some >> > relatively minor convenience on our part? >> >> Done in direct response to Simo. > > Sorry but your response sounds more like waving away then addressing > them, the excuse being: we can't please everyone, so we are going to > please no one. I obviously disagree with the take on my comments but you're free to your opinion. I believe saying we are pleasing no one isn't really fair now is it? Is there any type of audit container ID now? How would you go about associating audit events with containers now? (spoiler alert: it ain't pretty, and there are gaps I don't believe you can cover) This proposal provides a mechanism to do this in a way that isn't tied to any one particular concept of a container and is manageable inside the kernel. If you have a need to track audit events for containers, I find it extremely hard to believe that you are not at least partially pleased by the solutions presented here. It may not be everything on your wishlist, but when did you ever get *everything* on your wishlist? >> But to be clear Richard, we've talked about this a few times, it's not >> a "minor convenience" on our part, it's a pretty big convenience once >> we starting having to route audit events and make decisions based on >> the audit container ID information. Audit performance is less than >> awesome now, I'm working hard to not make it worse. > > Sounds like a security vs performance trade off to me. Welcome to software development. It's generally a pretty terrible hobby and/or occupation, but we make up for it with long hours and endless frustration. >> > u64 vs u128 is easy for us to >> > accomodate in terms of scalar comparisons. It doubles the information >> > in every container id field we print in audit records. >> >> ... and slows down audit container ID checks. > > Are you saying a cmp on a u128 is slower than a comparison on a u64 and > this is something that will be noticeable ? Do you have a 128 bit system? I don't. I've got a bunch of 64 bit systems, and a couple of 32 bit systems too. People that use audit have a tendency to really hammer on it, to the point that we get performance complaints on a not infrequent basis. I don't know the exact number of times we are going to need to check the audit container ID, but it's reasonable to think that we'll expose it as a filter-able field which adds a few checks, we'll use it for record routing so that's a few more, and if we're running multiple audit daemons we will probably want to include LSM checks which could result in a few more audit container ID checks. If it was one comparison I wouldn't be too worried about it, but the point I'm trying to make is that we don't know what the implementation is going to look like yet and I suspect this ID is going to be leveraged in several places in the audit subsystem and I would much rather start small to save headaches later. We can always expand the ID to a larger integer at a later date, but we can't make it smaller. >> > A c36 is a bigger step. >> >> Yeah, we're not doing that, no way. > > Ok, I can see your point though I do not agree with it. > > I can see why you do not want to have arbitrary length strings, but a > u128 sounded like a reasonable compromise to me as it has enough room > to be able to have unique cluster-wide IDs which a u64 definitely makes > a lot harder to provide w/o tight coordination. I originally wanted it to be a 32-bit integer, but Richard managed to talk me into 64-bits, that was my compromise :) As I said earlier, if you are doing container auditing you're going to need coordination with the orchestrator, regardless of the audit container ID size. -- paul moore www.paul-moore.com From serge at hallyn.com Sat Feb 3 01:57:21 2018 From: serge at hallyn.com (Serge E. Hallyn) Date: Fri, 2 Feb 2018 19:57:21 -0600 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> Message-ID: <20180203015721.GB27295@mail.hallyn.com> On Fri, Feb 02, 2018 at 05:05:22PM -0500, Paul Moore wrote: > On Tue, Jan 9, 2018 at 7:16 AM, Richard Guy Briggs wrote: > > Containers are a userspace concept. The kernel knows nothing of them. > > > > The Linux audit system needs a way to be able to track the container > > provenance of events and actions. Audit needs the kernel's help to do > > this. > > Two small comments below, but I tend to think we are at a point where > you can start cobbling together some prototype/RFC patches. Surely Agreed. LGTM. > there are going to be a few changes, and new comments, that come out > once we see an initial implementation so let's see what those are. thanks, -serge From casey at schaufler-ca.com Sat Feb 3 19:05:22 2018 From: casey at schaufler-ca.com (Casey Schaufler) Date: Sat, 3 Feb 2018 11:05:22 -0800 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> <1517609946.13097.161.camel@redhat.com> Message-ID: On 2/2/2018 3:24 PM, Paul Moore wrote: > On Fri, Feb 2, 2018 at 5:19 PM, Simo Sorce wrote: >> On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote: >>> On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs wrote: >>>> On 2018-01-09 11:18, Simo Sorce wrote: >>>>> On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: > .. > >>>> Paul, can you justify this somewhat larger inconvenience for some >>>> relatively minor convenience on our part? >>> Done in direct response to Simo. >> Sorry but your response sounds more like waving away then addressing >> them, the excuse being: we can't please everyone, so we are going to >> please no one. > I obviously disagree with the take on my comments but you're free to > your opinion. > > I believe saying we are pleasing no one isn't really fair now is it? > Is there any type of audit container ID now? How would you go about > associating audit events with containers now? (spoiler alert: it ain't > pretty, and there are gaps I don't believe you can cover) This > proposal provides a mechanism to do this in a way that isn't tied to > any one particular concept of a container and is manageable inside the > kernel. > > If you have a need to track audit events for containers, I find it > extremely hard to believe that you are not at least partially pleased > by the solutions presented here. It may not be everything on your > wishlist, but when did you ever get *everything* on your wishlist? I am going to back Paul 100% on this point. The container community's emphatic position that containers are strictly a user-space construct makes it impossible for the kernel to provide any data more sophisticated than an integer, and any processing based on that data cleverer than a check for equality. >>> But to be clear Richard, we've talked about this a few times, it's not >>> a "minor convenience" on our part, it's a pretty big convenience once >>> we starting having to route audit events and make decisions based on >>> the audit container ID information. Audit performance is less than >>> awesome now, I'm working hard to not make it worse. >> Sounds like a security vs performance trade off to me. Without the kernel having a "container" policy to work with there is no "security" it can possibly enforce. > Welcome to software development. It's generally a pretty terrible > hobby and/or occupation, but we make up for it with long hours and > endless frustration. > >>>> u64 vs u128 is easy for us to >>>> accomodate in terms of scalar comparisons. It doubles the information >>>> in every container id field we print in audit records. >>> ... and slows down audit container ID checks. >> Are you saying a cmp on a u128 is slower than a comparison on a u64 and >> this is something that will be noticeable ? > Do you have a 128 bit system? I don't. I've got a bunch of 64 bit > systems, and a couple of 32 bit systems too. People that use audit > have a tendency to really hammer on it, to the point that we get > performance complaints on a not infrequent basis. I don't know the > exact number of times we are going to need to check the audit > container ID, but it's reasonable to think that we'll expose it as a > filter-able field which adds a few checks, we'll use it for record > routing so that's a few more, and if we're running multiple audit > daemons we will probably want to include LSM checks which could result > in a few more audit container ID checks. If it was one comparison I > wouldn't be too worried about it, but the point I'm trying to make is > that we don't know what the implementation is going to look like yet > and I suspect this ID is going to be leveraged in several places in > the audit subsystem and I would much rather start small to save > headaches later. > > We can always expand the ID to a larger integer at a later date, but > we can't make it smaller. > >>>> A c36 is a bigger step. >>> Yeah, we're not doing that, no way. >> Ok, I can see your point though I do not agree with it. >> >> I can see why you do not want to have arbitrary length strings, but a >> u128 sounded like a reasonable compromise to me as it has enough room >> to be able to have unique cluster-wide IDs which a u64 definitely makes >> a lot harder to provide w/o tight coordination. > I originally wanted it to be a 32-bit integer, but Richard managed to > talk me into 64-bits, that was my compromise :) > > As I said earlier, if you are doing container auditing you're going to > need coordination with the orchestrator, regardless of the audit > container ID size. > From tycho at tycho.ws Sun Feb 4 10:49:43 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Sun, 4 Feb 2018 11:49:43 +0100 Subject: [RFC 0/3] seccomp trap to userspace Message-ID: <20180204104946.25559-1-tycho@tycho.ws> Several months ago at Linux Plumber's, we had a discussion about adding a feature to seccomp which would allow seccomp to trigger a notification for some other process. Here's a draft of that feature. Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to acquire the fd that receives notifications via ptrace (the method in patch 1 poses some problems). Other suggestions for how to acquire one of these fds would be welcome. Take a close look at the synchronization. I think I've got it right, but I probably don't :) Thanks! Tycho Andersen (3): seccomp: add a return code to trap to userspace seccomp: hoist out filter resolving logic seccomp: add a way to get a listener fd from ptrace arch/Kconfig | 7 + include/linux/seccomp.h | 14 +- include/uapi/linux/ptrace.h | 1 + include/uapi/linux/seccomp.h | 18 +- kernel/ptrace.c | 4 + kernel/seccomp.c | 467 ++++++++++++++++++++++++-- tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++- 7 files changed, 653 insertions(+), 38 deletions(-) -- 2.14.1 From tycho at tycho.ws Sun Feb 4 10:49:45 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Sun, 4 Feb 2018 11:49:45 +0100 Subject: [RFC 2/3] seccomp: hoist out filter resolving logic In-Reply-To: <20180204104946.25559-1-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> Message-ID: <20180204104946.25559-3-tycho@tycho.ws> Hoist out the nth filter resolving logic that ptrace uses into a new function. We'll use this in the next patch to implement the new PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch that I had sent a while ago; it significantly revamps the get_nth_filter logic based on previous suggestions from Oleg. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda --- kernel/seccomp.c | 77 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 45 insertions(+), 32 deletions(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 9541eb379e74..800db3f2866f 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1179,49 +1179,68 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) } #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE) -long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, - void __user *data) +static struct seccomp_filter *get_nth_filter(struct task_struct *task, + unsigned long filter_off) { - struct seccomp_filter *filter; - struct sock_fprog_kern *fprog; - long ret; - unsigned long count = 0; - - if (!capable(CAP_SYS_ADMIN) || - current->seccomp.mode != SECCOMP_MODE_DISABLED) { - return -EACCES; - } + struct seccomp_filter *orig, *filter; + unsigned long count; + /* + * Note: this is only correct because the caller should be the (ptrace) + * tracer of the task, otherwise lock_task_sighand is needed. + */ spin_lock_irq(&task->sighand->siglock); + if (task->seccomp.mode != SECCOMP_MODE_FILTER) { - ret = -EINVAL; - goto out; + spin_unlock_irq(&task->sighand->siglock); + return ERR_PTR(-EINVAL); } - filter = task->seccomp.filter; - while (filter) { - filter = filter->prev; + orig = task->seccomp.filter; + __get_seccomp_filter(orig); + spin_unlock_irq(&task->sighand->siglock); + + count = 0; + for (filter = orig; filter; filter = filter->prev) count++; - } if (filter_off >= count) { - ret = -ENOENT; + filter = ERR_PTR(-ENOENT); goto out; } - count -= filter_off; - filter = task->seccomp.filter; - while (filter && count > 1) { - filter = filter->prev; + count -= filter_off; + for (filter = orig; filter && count > 1; filter = filter->prev) count--; - } if (WARN_ON(count != 1 || !filter)) { - /* The filter tree shouldn't shrink while we're using it. */ - ret = -ENOENT; + filter = ERR_PTR(-ENOENT); goto out; } + __get_seccomp_filter(filter); + +out: + __put_seccomp_filter(orig); + return filter; +} + +long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, + void __user *data) +{ + struct seccomp_filter *filter; + struct sock_fprog_kern *fprog; + long ret; + + if (!capable(CAP_SYS_ADMIN) || + current->seccomp.mode != SECCOMP_MODE_DISABLED) { + return -EACCES; + } + + filter = get_nth_filter(task, filter_off); + if (IS_ERR(filter)) + return PTR_ERR(filter); + fprog = filter->prog->orig_prog; if (!fprog) { /* This must be a new non-cBPF filter, since we save @@ -1236,17 +1255,11 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, if (!data) goto out; - __get_seccomp_filter(filter); - spin_unlock_irq(&task->sighand->siglock); - if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog))) ret = -EFAULT; - __put_seccomp_filter(filter); - return ret; - out: - spin_unlock_irq(&task->sighand->siglock); + __put_seccomp_filter(filter); return ret; } #endif -- 2.14.1 From tycho at tycho.ws Sun Feb 4 10:49:46 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Sun, 4 Feb 2018 11:49:46 +0100 Subject: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace In-Reply-To: <20180204104946.25559-1-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> Message-ID: <20180204104946.25559-4-tycho@tycho.ws> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace() version which can acquire filters is useful. There are at least two reasons this is preferable, even though it uses ptrace: 1. You can control tasks that aren't cooperating with you 2. You can control tasks whose filters block sendmsg() and socket(); if the task installs a filter which blocks these calls, there's no way with SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda --- include/linux/seccomp.h | 11 +++++ include/uapi/linux/ptrace.h | 1 + kernel/ptrace.c | 4 ++ kernel/seccomp.c | 24 ++++++++++ tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++++++++++ 5 files changed, 106 insertions(+) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index ce07da2ffd53..0d4750e04bb1 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -103,4 +103,15 @@ static inline long seccomp_get_filter(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +extern long seccomp_get_listener(struct task_struct *task, + unsigned long filter_off); +#else +static inline long seccomp_get_listener(struct task_struct *task, + unsigned long filter_off) +{ + return -EINVAL; +} +#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */ #endif /* _LINUX_SECCOMP_H */ diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index e3939e00980b..60113de59b04 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -66,6 +66,7 @@ struct ptrace_peeksiginfo_args { #define PTRACE_SETSIGMASK 0x420b #define PTRACE_SECCOMP_GET_FILTER 0x420c +#define PTRACE_SECCOMP_GET_LISTENER 0x420d /* Read signals from a shared (process wide) queue */ #define PTRACE_PEEKSIGINFO_SHARED (1 << 0) diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 84b1367935e4..50d8cc8be054 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -1092,6 +1092,10 @@ int ptrace_request(struct task_struct *child, long request, ret = seccomp_get_filter(child, addr, datavp); break; + case PTRACE_SECCOMP_GET_LISTENER: + ret = seccomp_get_listener(child, addr); + break; + default: break; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 800db3f2866f..0b1f65273d2a 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1605,4 +1605,28 @@ static struct file *init_listener(struct seccomp_filter *filter) mutex_unlock(&filter->notify_lock); return ret; } + +long seccomp_get_listener(struct task_struct *task, + unsigned long filter_off) +{ + struct seccomp_filter *filter; + struct file *listener; + int fd; + + filter = get_nth_filter(task, filter_off); + if (IS_ERR(filter)) + return PTR_ERR(filter); + + listener = init_listener(filter); + if (IS_ERR(listener)) + return PTR_ERR(listener); + + fd = get_unused_fd_flags(O_RDWR); + if (fd < 0) + put_filp(listener); + else + fd_install(fd, listener); + + return fd; +} #endif diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index b43e2a70b08c..80f89a766895 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -168,6 +168,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args) } #endif +#ifndef PTRACE_SECCOMP_GET_LISTENER +#define PTRACE_SECCOMP_GET_LISTENER 0x420d +#endif + #if __BYTE_ORDER == __LITTLE_ENDIAN #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n])) #elif __BYTE_ORDER == __BIG_ENDIAN @@ -2957,6 +2961,68 @@ TEST(get_user_notification_syscall) close(listener); } +TEST(get_user_notification_ptrace) +{ + pid_t pid; + int status, listener; + int sk_pair[2]; + char c; + struct seccomp_notif req; + struct seccomp_notif_resp resp; + + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0); + + /* Test that we get ENOSYS while not attached */ + ASSERT_EQ(syscall(__NR_getpid), -1); + ASSERT_EQ(errno, ENOSYS); + + /* Signal we're ready and have installed the filter. */ + ASSERT_EQ(write(sk_pair[1], "J", 1), 1); + + ASSERT_EQ(read(sk_pair[1], &c, 1), 1); + ASSERT_EQ(c, 'H'); + + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); + } + + ASSERT_EQ(read(sk_pair[0], &c, 1), 1); + ASSERT_EQ(c, 'J'); + + ASSERT_EQ(ptrace(PTRACE_ATTACH, pid), 0); + ASSERT_EQ(waitpid(pid, NULL, 0), pid); + listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0); + ASSERT_GE(listener, 0); + + /* EBUSY for second listener */ + ASSERT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1); + ASSERT_EQ(errno, EBUSY); + + ASSERT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0); + + /* Now signal we are done and respond with magic */ + ASSERT_EQ(write(sk_pair[0], "H", 1), 1); + + ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + ASSERT_EQ(true, WIFEXITED(status)); + ASSERT_EQ(0, WEXITSTATUS(status)); + + close(listener); +} + /* * TODO: * - add microbenchmarks -- 2.14.1 From tycho at tycho.ws Sun Feb 4 10:49:44 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Sun, 4 Feb 2018 11:49:44 +0100 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: <20180204104946.25559-1-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> Message-ID: <20180204104946.25559-2-tycho@tycho.ws> This patch introduces a means for syscalls matched in seccomp to notify some other task that a particular filter has been triggered. The motivation for this is primarily for use with containers. For example, if a container does an init_module(), we obviously don't want to load this untrusted code, which may be compiled for the wrong version of the kernel anyway. Instead, we could parse the module image, figure out which module the container is trying to load and load it on the host. As another example, containers cannot mknod(), since this checks capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or /dev/zero should be ok for containers to mknod, but we'd like to avoid hard coding some whitelist in the kernel. Another example is mount(), which has many security restrictions for good reason, but configuration or runtime knowledge could potentially be used to relax these restrictions. This patch adds functionality that is already possible via at least two other means that I know about, both of which involve ptrace(): first, one could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. Unfortunately this is slow, so a faster version would be to install a filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. Since ptrace allows only one tracer, if the container runtime is that tracer, users inside the container (or outside) trying to debug it will not be able to use ptrace, which is annoying. It also means that older distributions based on Upstart cannot boot inside containers using ptrace, since upstart itself uses ptrace to start services. The actual implementation of this is fairly small, although getting the synchronization right was/is slightly complex. Also worth noting that there is one race still present: 1. a task does a SECCOMP_RET_USER_NOTIF 2. the userspace handler reads this notification 3. the task dies 4. a new task with the same pid starts 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id that the previous one did 6. the userspace handler writes a response There's no way to distinguish this case right now. Maybe we care, maybe we don't, but it's worth noting. Right now the interface is a simple structure copy across a file descriptor. We could potentially invent something fancier. Finally, it's worth noting that the classic seccomp TOCTOU of reading memory data from the task still applies here, but can be avoided with careful design of the userspace handler: if the userspace handler reads all of the task memory that is necessary before applying its security policy, the tracee's subsequent memory edits will not be read by the tracer. Signed-off-by: Tycho Andersen CC: Kees Cook CC: Andy Lutomirski CC: Oleg Nesterov CC: Eric W. Biederman CC: "Serge E. Hallyn" CC: Christian Brauner CC: Tyler Hicks CC: Akihiro Suda --- arch/Kconfig | 7 + include/linux/seccomp.h | 3 +- include/uapi/linux/seccomp.h | 18 +- kernel/seccomp.c | 366 +++++++++++++++++++++++++- tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++- 5 files changed, 502 insertions(+), 6 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 400b9e1b2f27..2946cb6fd704 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -387,6 +387,13 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config SECCOMP_USER_NOTIFICATION + bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action" + depends on SECCOMP_FILTER + help + Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp + programs to notify a userspace listener that a particular event happened. + config HAVE_GCC_PLUGINS bool help diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 10f25f7e4304..ce07da2ffd53 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -5,7 +5,8 @@ #include #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ - SECCOMP_FILTER_FLAG_LOG) + SECCOMP_FILTER_FLAG_LOG | \ + SECCOMP_FILTER_FLAG_GET_LISTENER) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 2a0bd9dd104d..4a342aa2e524 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -17,8 +17,9 @@ #define SECCOMP_GET_ACTION_AVAIL 2 /* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC 1 -#define SECCOMP_FILTER_FLAG_LOG 2 +#define SECCOMP_FILTER_FLAG_TSYNC 1 +#define SECCOMP_FILTER_FLAG_LOG 2 +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 /* * All BPF programs must return a 32-bit value. @@ -34,6 +35,7 @@ #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ @@ -59,4 +61,16 @@ struct seccomp_data { __u64 args[6]; }; +struct seccomp_notif { + __u32 id; + pid_t pid; + struct seccomp_data data; +}; + +struct seccomp_notif_resp { + __u32 id; + int error; + long val; +}; + #endif /* _UAPI_LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 5f0dfb2abb8d..9541eb379e74 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -38,6 +38,52 @@ #include #include +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +#include +#include + +enum notify_state { + SECCOMP_NOTIFY_INIT, + SECCOMP_NOTIFY_READ, + SECCOMP_NOTIFY_WRITE, +}; + +struct seccomp_knotif { + /* The pid whose filter triggered the notification */ + pid_t pid; + + /* + * The "cookie" for this request; this is unique for this filter. + */ + u32 id; + + /* + * The seccomp data. This pointer is valid the entire time this + * notification is active, since it comes from __seccomp_filter which + * eclipses the entire lifecycle here. + */ + const struct seccomp_data *data; + + /* + * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not + * yet been sent to userspace + * SECCOMP_NOTIFY_READ: sent to userspace but no response yet + * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has + * not yet been written back to the application + */ + enum notify_state state; + + /* The return values, only valid when in SECCOMP_NOTIFY_WRITE */ + int error; + long val; + + /* Signals when this has entered SECCOMP_NOTIFY_WRITE */ + struct completion ready; + + struct list_head list; +}; +#endif + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -64,6 +110,30 @@ struct seccomp_filter { bool log; struct seccomp_filter *prev; struct bpf_prog *prog; + +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION + /* + * A semaphore that users of this notification can wait on for + * changes. Actual reads and writes are still controlled with + * filter->notify_lock. + */ + struct semaphore request; + + /* + * A lock for all notification-related accesses. + */ + struct mutex notify_lock; + + /* + * Is there currently an attached listener? + */ + bool has_listener; + + /* + * A list of struct seccomp_knotif elements. + */ + struct list_head notifications; +#endif }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) if (!sfilter) return ERR_PTR(-ENOMEM); +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION + mutex_init(&sfilter->notify_lock); + sema_init(&sfilter->request, 0); + INIT_LIST_HEAD(&sfilter->notifications); +#endif + ret = bpf_prog_create_from_user(&sfilter->prog, fprog, seccomp_check_filter, save_orig); if (ret < 0) { @@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason) #define SECCOMP_LOG_TRACE (1 << 4) #define SECCOMP_LOG_LOG (1 << 5) #define SECCOMP_LOG_ALLOW (1 << 6) +#define SECCOMP_LOG_USER_NOTIF (1 << 7) static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS | SECCOMP_LOG_KILL_THREAD | SECCOMP_LOG_TRAP | SECCOMP_LOG_ERRNO | SECCOMP_LOG_TRACE | - SECCOMP_LOG_LOG; + SECCOMP_LOG_LOG | + SECCOMP_LOG_USER_NOTIF; static inline void seccomp_log(unsigned long syscall, long signr, u32 action, bool requested) @@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action, case SECCOMP_RET_TRACE: log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE; break; + case SECCOMP_RET_USER_NOTIF: + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF; + break; case SECCOMP_RET_LOG: log = seccomp_actions_logged & SECCOMP_LOG_LOG; break; @@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall) } #else +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +/* + * Finds the next unique notification id. + */ +static u32 seccomp_next_notify_id(struct list_head *list) +{ + struct seccomp_knotif *knotif = NULL; + struct list_head *cur; + u32 id = get_random_u32(); + +again: + list_for_each(cur, list) { + knotif = list_entry(cur, struct seccomp_knotif, list); + + if (knotif->id == id) { + id = get_random_u32(); + goto again; + } + } + + return id; +} + +static void seccomp_do_user_notification(int this_syscall, + struct seccomp_filter *match, + const struct seccomp_data *sd) +{ + int err; + long ret = 0; + struct seccomp_knotif n = {}; + + mutex_lock(&match->notify_lock); + if (!match->has_listener) { + err = -ENOSYS; + goto out; + } + + n.pid = current->pid; + n.state = SECCOMP_NOTIFY_INIT; + n.data = sd; + n.id = seccomp_next_notify_id(&match->notifications); + init_completion(&n.ready); + + list_add(&n.list, &match->notifications); + + mutex_unlock(&match->notify_lock); + up(&match->request); + + err = wait_for_completion_interruptible(&n.ready); + /* + * This syscall is getting interrupted. We no longer need to + * tell userspace about it, and any userspace responses should + * be ignored. + */ + mutex_lock(&match->notify_lock); + if (err < 0) + goto remove_list; + + ret = n.val; + err = n.error; + + WARN(n.state != SECCOMP_NOTIFY_WRITE, + "notified about write complete when state is not write"); + +remove_list: + list_del(&n.list); +out: + mutex_unlock(&match->notify_lock); + syscall_set_return_value(current, task_pt_regs(current), + err, ret); +} +#else +static void seccomp_do_user_notification(int this_syscall, + u32 action, + struct seccomp_filter *match, + const struct seccomp_data *sd) +{ + WARN(1, "user notification received, but disabled"); + seccomp_log(this_syscall, SIGSYS, action, true); + do_exit(SIGSYS); +} +#endif + #ifdef CONFIG_SECCOMP_FILTER static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, const bool recheck_after_trace) @@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, return 0; + case SECCOMP_RET_USER_NOTIF: + seccomp_do_user_notification(this_syscall, match, sd); + goto skip; case SECCOMP_RET_LOG: seccomp_log(this_syscall, 0, action, true); return 0; @@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void) } #ifdef CONFIG_SECCOMP_FILTER +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +static struct file *init_listener(struct seccomp_filter *filter); +#endif + /** * seccomp_set_mode_filter: internal function for setting seccomp filter * @flags: flags to change filter behavior @@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags, const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; + int listener = 0; + struct file *listener_f = NULL; /* Validate flags. */ if (flags & ~SECCOMP_FILTER_FLAG_MASK) @@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags, if (IS_ERR(prepared)) return PTR_ERR(prepared); + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { + listener = get_unused_fd_flags(O_RDWR); + if (listener < 0) { + ret = listener; + goto out_free; + } + + listener_f = init_listener(prepared); + if (IS_ERR(listener_f)) { + put_unused_fd(listener); + ret = PTR_ERR(listener_f); + goto out_free; + } + } + /* * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ if (flags & SECCOMP_FILTER_FLAG_TSYNC && mutex_lock_killable(¤t->signal->cred_guard_mutex)) - goto out_free; + goto out_put_fd; spin_lock_irq(¤t->sighand->siglock); @@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags, spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) mutex_unlock(¤t->signal->cred_guard_mutex); +out_put_fd: + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { + if (ret < 0) { + fput(listener_f); + put_unused_fd(listener); + } else { + fd_install(listener, listener_f); + ret = listener; + } + } out_free: seccomp_filter_free(prepared); return ret; @@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction) case SECCOMP_RET_LOG: case SECCOMP_RET_ALLOW: break; + case SECCOMP_RET_USER_NOTIF: + if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION)) + break; default: return -EOPNOTSUPP; } @@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread" #define SECCOMP_RET_TRAP_NAME "trap" #define SECCOMP_RET_ERRNO_NAME "errno" +#define SECCOMP_RET_USER_NOTIF_NAME "user_notif" #define SECCOMP_RET_TRACE_NAME "trace" #define SECCOMP_RET_LOG_NAME "log" #define SECCOMP_RET_ALLOW_NAME "allow" @@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] = SECCOMP_RET_KILL_THREAD_NAME " " SECCOMP_RET_TRAP_NAME " " SECCOMP_RET_ERRNO_NAME " " + SECCOMP_RET_USER_NOTIF_NAME " " SECCOMP_RET_TRACE_NAME " " SECCOMP_RET_LOG_NAME " " SECCOMP_RET_ALLOW_NAME; @@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = { { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME }, { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME }, { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME }, + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME }, { } }; @@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +static int seccomp_notify_release(struct inode *inode, struct file *file) +{ + struct seccomp_filter *filter = file->private_data; + struct list_head *cur; + + mutex_lock(&filter->notify_lock); + + /* + * If this file is being closed because e.g. the task who owned it + * died, let's wake everyone up who was waiting on us. + */ + list_for_each(cur, &filter->notifications) { + struct seccomp_knotif *knotif; + + knotif = list_entry(cur, struct seccomp_knotif, list); + + knotif->state = SECCOMP_NOTIFY_WRITE; + knotif->error = -ENOSYS; + knotif->val = 0; + complete(&knotif->ready); + } + + filter->has_listener = false; + mutex_unlock(&filter->notify_lock); + __put_seccomp_filter(filter); + return 0; +} + +static ssize_t seccomp_notify_read(struct file *f, char __user *buf, + size_t size, loff_t *ppos) +{ + struct seccomp_filter *filter = f->private_data; + struct seccomp_knotif *knotif = NULL; + struct seccomp_notif unotif; + struct list_head *cur; + ssize_t ret; + + /* No offset reads. */ + if (*ppos != 0) + return -EINVAL; + + ret = down_interruptible(&filter->request); + if (ret < 0) + return ret; + + mutex_lock(&filter->notify_lock); + list_for_each(cur, &filter->notifications) { + knotif = list_entry(cur, struct seccomp_knotif, list); + if (knotif->state == SECCOMP_NOTIFY_INIT) + break; + } + + /* + * We didn't find anything which is odd, because at least one + * thing should have been queued. + */ + if (knotif->state != SECCOMP_NOTIFY_INIT) { + ret = -ENOENT; + WARN(1, "no seccomp notification found"); + goto out; + } + + unotif.id = knotif->id; + unotif.pid = knotif->pid; + unotif.data = *(knotif->data); + + size = min_t(size_t, size, sizeof(struct seccomp_notif)); + if (copy_to_user(buf, &unotif, size)) { + ret = -EFAULT; + goto out; + } + + ret = sizeof(unotif); + knotif->state = SECCOMP_NOTIFY_READ; + +out: + mutex_unlock(&filter->notify_lock); + return ret; +} + +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, + size_t size, loff_t *ppos) +{ + struct seccomp_filter *filter = file->private_data; + struct seccomp_notif_resp resp = {}; + struct seccomp_knotif *knotif = NULL; + struct list_head *cur; + ssize_t ret = -EINVAL; + + /* No partial writes. */ + if (*ppos != 0) + return -EINVAL; + + size = min_t(size_t, size, sizeof(resp)); + if (copy_from_user(&resp, buf, size)) + return -EFAULT; + + ret = mutex_lock_interruptible(&filter->notify_lock); + if (ret < 0) + return ret; + + list_for_each(cur, &filter->notifications) { + knotif = list_entry(cur, struct seccomp_knotif, list); + + if (knotif->id == resp.id) + break; + } + + if (!knotif || knotif->id != resp.id) { + ret = -EINVAL; + goto out; + } + + ret = size; + knotif->state = SECCOMP_NOTIFY_WRITE; + knotif->error = resp.error; + knotif->val = resp.val; + complete(&knotif->ready); +out: + mutex_unlock(&filter->notify_lock); + return ret; +} + +static const struct file_operations seccomp_notify_ops = { + .read = seccomp_notify_read, + .write = seccomp_notify_write, + /* TODO: poll */ + .release = seccomp_notify_release, +}; + +static struct file *init_listener(struct seccomp_filter *filter) +{ + struct file *ret; + + mutex_lock(&filter->notify_lock); + if (filter->has_listener) { + mutex_unlock(&filter->notify_lock); + return ERR_PTR(-EBUSY); + } + + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, + filter, O_RDWR); + if (IS_ERR(ret)) { + __put_seccomp_filter(filter); + } else { + /* + * Intentionally don't put_seccomp_filter(). The file + * has a reference to it now. + */ + filter->has_listener = true; + } + + mutex_unlock(&filter->notify_lock); + return ret; +} +#endif diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 24dbf634e2dd..b43e2a70b08c 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -40,6 +40,7 @@ #include #include #include +#include #define _GNU_SOURCE #include @@ -141,6 +142,24 @@ struct seccomp_data { #define SECCOMP_FILTER_FLAG_LOG 2 #endif +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 + +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U + +struct seccomp_notif { + __u32 id; + pid_t pid; + struct seccomp_data data; +}; + +struct seccomp_notif_resp { + __u32 id; + int error; + long val; +}; +#endif + #ifndef seccomp int seccomp(unsigned int op, unsigned int flags, void *args) { @@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock) TEST(detect_seccomp_filter_flags) { unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, - SECCOMP_FILTER_FLAG_LOG }; + SECCOMP_FILTER_FLAG_LOG, + SECCOMP_FILTER_FLAG_GET_LISTENER }; unsigned int flag, all_flags; int i; long ret; @@ -2845,6 +2865,98 @@ TEST(get_action_avail) EXPECT_EQ(errno, EOPNOTSUPP); } +static int user_trap_syscall(int nr, unsigned int flags) +{ + struct sock_filter filter[] = { + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, + offsetof(struct seccomp_data, nr)), + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), + }; + + struct sock_fprog prog = { + .len = (unsigned short)ARRAY_SIZE(filter), + .filter = filter, + }; + + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); +} + +#define USER_NOTIF_MAGIC 116983961184613L +TEST(get_user_notification_syscall) +{ + pid_t pid; + long ret; + int status, listener; + struct seccomp_notif req; + struct seccomp_notif_resp resp; + + pid = fork(); + ASSERT_GE(pid, 0); + + /* Check that we get -ENOSYS with no listener attached */ + if (pid == 0) { + ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0); + ret = syscall(__NR_getpid); + exit(ret >= 0 || errno != ENOSYS); + } + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + ASSERT_EQ(true, WIFEXITED(status)); + ASSERT_EQ(0, WEXITSTATUS(status)); + + /* Check that the basic notification machinery works */ + listener = user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_GET_LISTENER); + ASSERT_GE(listener, 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); + + ASSERT_EQ(waitpid(pid, &status, 0), pid); + ASSERT_EQ(true, WIFEXITED(status)); + ASSERT_EQ(0, WEXITSTATUS(status)); + + /* + * Check that nothing bad happens when we kill the task in the middle + * of a syscall. + */ + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + ret = read(listener, &req, sizeof(req)); + ASSERT_EQ(ret, sizeof(req)); + + ASSERT_EQ(kill(pid, SIGKILL), 0); + ASSERT_EQ(waitpid(pid, NULL, 0), pid); + + resp.id = req.id; + ret = write(listener, &resp, sizeof(resp)); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, EINVAL); + + close(listener); +} + /* * TODO: * - add microbenchmarks -- 2.14.1 From luto at amacapital.net Sun Feb 4 17:36:33 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Sun, 4 Feb 2018 17:36:33 +0000 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: <20180204104946.25559-2-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> Message-ID: On Sun, Feb 4, 2018 at 10:49 AM, Tycho Andersen wrote: > This patch introduces a means for syscalls matched in seccomp to notify > some other task that a particular filter has been triggered. Neat! > > The motivation for this is primarily for use with containers. For example, > if a container does an init_module(), we obviously don't want to load this > untrusted code, which may be compiled for the wrong version of the kernel > anyway. Instead, we could parse the module image, figure out which module > the container is trying to load and load it on the host. > > As another example, containers cannot mknod(), since this checks > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > coding some whitelist in the kernel. Another example is mount(), which has > many security restrictions for good reason, but configuration or runtime > knowledge could potentially be used to relax these restrictions. > > This patch adds functionality that is already possible via at least two > other means that I know about, both of which involve ptrace(): first, one > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. > Unfortunately this is slow, so a faster version would be to install a > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. > Since ptrace allows only one tracer, if the container runtime is that > tracer, users inside the container (or outside) trying to debug it will not > be able to use ptrace, which is annoying. It also means that older > distributions based on Upstart cannot boot inside containers using ptrace, > since upstart itself uses ptrace to start services. > > The actual implementation of this is fairly small, although getting the > synchronization right was/is slightly complex. Also worth noting that there > is one race still present: > > 1. a task does a SECCOMP_RET_USER_NOTIF > 2. the userspace handler reads this notification > 3. the task dies > 4. a new task with the same pid starts > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > that the previous one did > 6. the userspace handler writes a response I'm slightly confused. I thought the id was never reused for a given struct seccomp_filter. (Also, shouldn't the id be u64, not u32?) On very quick reading, I have a question. What happens if a process has two seccomp_filters attached, one of them returns SECCOMP_RET_USER_NOTIF, and the *other* one has a listener? From tycho at tycho.ws Sun Feb 4 20:01:29 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Sun, 4 Feb 2018 21:01:29 +0100 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> Message-ID: <20180204200129.2bgq5yfaimg6xdg5@cisco> Hi Andy, On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote: > > The actual implementation of this is fairly small, although getting the > > synchronization right was/is slightly complex. Also worth noting that there > > is one race still present: > > > > 1. a task does a SECCOMP_RET_USER_NOTIF > > 2. the userspace handler reads this notification > > 3. the task dies > > 4. a new task with the same pid starts > > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > > that the previous one did > > 6. the userspace handler writes a response > > I'm slightly confused. I thought the id was never reused for a given > struct seccomp_filter. (Also, shouldn't the id be u64, not u32?) Well, what happens when u32/64 overflows? Eventually it will wrap. > On very quick reading, I have a question. What happens if a process > has two seccomp_filters attached, one of them returns > SECCOMP_RET_USER_NOTIF, and the *other* one has a listener? Good question, in seccomp_run_filters(), the first (lowest, last applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that gets the notification and the other receives nothing. I don't really have any reason to prefer this behavior, it's just what happened without much thought. Cheers, Tycho From luto at kernel.org Sun Feb 4 20:33:25 2018 From: luto at kernel.org (Andy Lutomirski) Date: Sun, 4 Feb 2018 20:33:25 +0000 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: <20180204200129.2bgq5yfaimg6xdg5@cisco> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180204200129.2bgq5yfaimg6xdg5@cisco> Message-ID: On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen wrote: > Hi Andy, > > On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote: >> > The actual implementation of this is fairly small, although getting the >> > synchronization right was/is slightly complex. Also worth noting that there >> > is one race still present: >> > >> > 1. a task does a SECCOMP_RET_USER_NOTIF >> > 2. the userspace handler reads this notification >> > 3. the task dies >> > 4. a new task with the same pid starts >> > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id >> > that the previous one did >> > 6. the userspace handler writes a response >> >> I'm slightly confused. I thought the id was never reused for a given >> struct seccomp_filter. (Also, shouldn't the id be u64, not u32?) > > Well, what happens when u32/64 overflows? Eventually it will wrap. I think we can safely assume that u64 won't overflow. Even if we processed one user return notification on a given seccomp_filter every nanosecond (which would be insanely fast), that's 584 years. > >> On very quick reading, I have a question. What happens if a process >> has two seccomp_filters attached, one of them returns >> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener? > > Good question, in seccomp_run_filters(), the first (lowest, last > applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that > gets the notification and the other receives nothing. > > I don't really have any reason to prefer this behavior, it's just what > happened without much thought. Hmm. This won't nest right. Maybe we should just disallow a user-notification-using filter from being applied if there is already one in the stack. Then, if anyone cares about making these things nest right, they can fix it. From tycho at tycho.ws Mon Feb 5 08:47:36 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Mon, 5 Feb 2018 09:47:36 +0100 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180204200129.2bgq5yfaimg6xdg5@cisco> Message-ID: <20180205084736.biqc4mflczsix6wm@cisco> On Sun, Feb 04, 2018 at 08:33:25PM +0000, Andy Lutomirski wrote: > On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen wrote: > > Hi Andy, > > > > On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote: > >> > The actual implementation of this is fairly small, although getting the > >> > synchronization right was/is slightly complex. Also worth noting that there > >> > is one race still present: > >> > > >> > 1. a task does a SECCOMP_RET_USER_NOTIF > >> > 2. the userspace handler reads this notification > >> > 3. the task dies > >> > 4. a new task with the same pid starts > >> > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > >> > that the previous one did > >> > 6. the userspace handler writes a response > >> > >> I'm slightly confused. I thought the id was never reused for a given > >> struct seccomp_filter. (Also, shouldn't the id be u64, not u32?) > > > > Well, what happens when u32/64 overflows? Eventually it will wrap. > > I think we can safely assume that u64 won't overflow. Even if we > processed one user return notification on a given seccomp_filter every > nanosecond (which would be insanely fast), that's 584 years. Yes, fair point r.e. u64. I'll make the change. > > > >> On very quick reading, I have a question. What happens if a process > >> has two seccomp_filters attached, one of them returns > >> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener? > > > > Good question, in seccomp_run_filters(), the first (lowest, last > > applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that > > gets the notification and the other receives nothing. > > > > I don't really have any reason to prefer this behavior, it's just what > > happened without much thought. > > Hmm. This won't nest right. Maybe we should just disallow a > user-notification-using filter from being applied if there is already > one in the stack. Then, if anyone cares about making these things > nest right, they can fix it. Sounds fine to me, I'll add a check. Cheers, Tycho From simo at redhat.com Mon Feb 5 13:47:36 2018 From: simo at redhat.com (Simo Sorce) Date: Mon, 05 Feb 2018 08:47:36 -0500 Subject: RFC(V3): Audit Kernel Container IDs In-Reply-To: References: <20180109121620.wi7dq2423ugsraqv@madcap2.tricolour.ca> <1515514736.3239.10.camel@redhat.com> <20180110070011.l4rcdcwb27witfem@madcap2.tricolour.ca> <1517609946.13097.161.camel@redhat.com> Message-ID: <1517838456.13097.163.camel@redhat.com> On Fri, 2018-02-02 at 18:24 -0500, Paul Moore wrote: > On Fri, Feb 2, 2018 at 5:19 PM, Simo Sorce wrote: > > On Fri, 2018-02-02 at 16:24 -0500, Paul Moore wrote: > > > On Wed, Jan 10, 2018 at 2:00 AM, Richard Guy Briggs wrote: > > > > On 2018-01-09 11:18, Simo Sorce wrote: > > > > > On Tue, 2018-01-09 at 07:16 -0500, Richard Guy Briggs wrote: > > ... > > > > > Paul, can you justify this somewhat larger inconvenience for some > > > > relatively minor convenience on our part? > > > > > > Done in direct response to Simo. > > > > Sorry but your response sounds more like waving away then addressing > > them, the excuse being: we can't please everyone, so we are going to > > please no one. > > I obviously disagree with the take on my comments but you're free to > your opinion. The I misunderstood your comments, I am not interested in putting words in your mouth. > I believe saying we are pleasing no one isn't really fair now is it? Well, of course you are going to please the audit subsystem, I understand that. I think there is a problem of expectations. Some people, me included, hoped to have a way to identify a container with the help of the kernel. > Is there any type of audit container ID now? How would you go about > associating audit events with containers now? We do not have a good way, there are some dirty tricks like inferring the container identity via cgroup names, but that is ... eww. This is why, given audit has the same need of user space, there was some hope we could agree on an identifier that could be used by both. It would make correlating audit logs and other cluster-wide events simpler. That is all. > (spoiler alert: it ain't > pretty, and there are gaps I don't believe you can cover) This > proposal provides a mechanism to do this in a way that isn't tied to > any one particular concept of a container and is manageable inside the > kernel. I like the proposal for the most part, we are just discussing on the nature of the identifier, which is a minor detail in the end. > If you have a need to track audit events for containers, I find it > extremely hard to believe that you are not at least partially pleased > by the solutions presented here. It may not be everything on your > wishlist, but when did you ever get *everything* on your wishlist? It is true, and I am sorry if I came out demanding or abrasive. It was not my intention. Of course a u64 that has to be mapped is still better than nothing. It does cause a lot more work in user space, but it is not impossible to deal with. > > > But to be clear Richard, we've talked about this a few times, it's not > > > a "minor convenience" on our part, it's a pretty big convenience once > > > we starting having to route audit events and make decisions based on > > > the audit container ID information. Audit performance is less than > > > awesome now, I'm working hard to not make it worse. > > > > Sounds like a security vs performance trade off to me. > > Welcome to software development. It's generally a pretty terrible > hobby and/or occupation, but we make up for it with long hours and > endless frustration. Tell me more about that, not! ;-) > > > > u64 vs u128 is easy for us to > > > > accomodate in terms of scalar comparisons. It doubles the information > > > > in every container id field we print in audit records. > > > > > > ... and slows down audit container ID checks. > > > > Are you saying a cmp on a u128 is slower than a comparison on a u64 and > > this is something that will be noticeable ? > > Do you have a 128 bit system? no, but all 64bit systems have an instruction that allow you to do atomic 128 compare and swap (IIRC ?). > I don't. I've got a bunch of 64 bit > systems, and a couple of 32 bit systems too. People that use audit > have a tendency to really hammer on it, to the point that we get > performance complaints on a not infrequent basis. I don't know the > exact number of times we are going to need to check the audit > container ID, but it's reasonable to think that we'll expose it as a > filter-able field which adds a few checks, we'll use it for record > routing so that's a few more, and if we're running multiple audit > daemons we will probably want to include LSM checks which could result > in a few more audit container ID checks. If it was one comparison I > wouldn't be too worried about it, but the point I'm trying to make is > that we don't know what the implementation is going to look like yet > and I suspect this ID is going to be leveraged in several places in > the audit subsystem and I would much rather start small to save > headaches later. > > We can always expand the ID to a larger integer at a later date, but > we can't make it smaller. Well looking through the history of in kernel identifiers I know it is hard also to increase size, because userspace will end up depending on a specific size ... and this is the only reason I am really debating this. If it were really easy to change I wouldn't bother to do it now. > > > > A c36 is a bigger step. > > > > > > Yeah, we're not doing that, no way. > > > > Ok, I can see your point though I do not agree with it. > > > > I can see why you do not want to have arbitrary length strings, but a > > u128 sounded like a reasonable compromise to me as it has enough room > > to be able to have unique cluster-wide IDs which a u64 definitely makes > > a lot harder to provide w/o tight coordination. > > I originally wanted it to be a 32-bit integer, but Richard managed to > talk me into 64-bits, that was my compromise :) > > As I said earlier, if you are doing container auditing you're going to > need coordination with the orchestrator, regardless of the audit > container ID size. Ok, I guess that's as good as I can get it for now, thank you for your patient explanations. Simo. -- Simo Sorce Sr. Principal Software Engineer Red Hat, Inc From noreply at watarasepc.com Thu Feb 8 17:33:17 2018 From: noreply at watarasepc.com (Canadian-Drugs) Date: Thu, 8 Feb 2018 11:33:17 -0600 Subject: Our pharmacy is the place where people find answers to most tricky questions of life! Message-ID: <4C13A2BB.8069435@watarasepc.com> Very efficient service. Very efficient delivery! ENTER HERE < Hehimself maintains a cloud of favorite beer. Investigation, engineers as she could. Expenses you electronic work with none is solomon president. From mszeredi at redhat.com Mon Feb 12 15:57:31 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Mon, 12 Feb 2018 16:57:31 +0100 Subject: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: References: Message-ID: On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > From: Seth Forshee > > In order to support mounts from namespaces other than > init_user_ns, fuse must translate uids and gids to/from the > userns of the process servicing requests on /dev/fuse. This > patch does that, with a couple of restrictions on the namespace: > > - The userns for the fuse connection is fixed to the namespace > from which /dev/fuse is opened. > > - The namespace must be the same as s_user_ns. > > These restrictions simplify the implementation by avoiding the > need to pass around userns references and by allowing fuse to > rely on the checks in inode_change_ok for ownership changes. > Either restriction could be relaxed in the future if needed. Can we not introduce potential userspace interface regressions? The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse: allow server to run in different pid_ns") will probably bite us here as well. We basically need two modes of operation: a) old, backward compatible (not introducing any new failure mores), created with privileged mount b) new, non-backward compatible, created with unprivileged mount Technically there would still be a risk from breaking userspace, since we are using the same entry point for both, but let's hope that no practical problems come from that. > For cuse the namespace used for the connection is also simply > current_user_ns() at the time /dev/cuse is opened. > > Patch v4 is available: https://patchwork.kernel.org/patch/8944661/ > > Cc: linux-fsdevel at vger.kernel.org > Cc: linux-kernel at vger.kernel.org > Cc: Miklos Szeredi > Signed-off-by: Seth Forshee > Signed-off-by: Dongsu Park > --- > fs/fuse/cuse.c | 3 ++- > fs/fuse/dev.c | 11 ++++++++--- > fs/fuse/dir.c | 14 +++++++------- > fs/fuse/fuse_i.h | 6 +++++- > fs/fuse/inode.c | 31 +++++++++++++++++++------------ > 5 files changed, 41 insertions(+), 24 deletions(-) > > diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c > index e9e97803..b1b83259 100644 > --- a/fs/fuse/cuse.c > +++ b/fs/fuse/cuse.c > @@ -48,6 +48,7 @@ > #include > #include > #include > +#include > > #include "fuse_i.h" > > @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file) > if (!cc) > return -ENOMEM; > > - fuse_conn_init(&cc->fc); > + fuse_conn_init(&cc->fc, current_user_ns()); > > fud = fuse_dev_alloc(&cc->fc); > if (!fud) { > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index 17f0d05b..0f780e16 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) > > static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) > { > - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); > - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); > + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); > + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); > req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); > } > > @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, > __set_bit(FR_WAITING, &req->flags); > if (for_background) > __set_bit(FR_BACKGROUND, &req->flags); > + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) { > + fuse_put_request(fc, req); > + return ERR_PTR(-EOVERFLOW); > + } > > return req; > > @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, > in = &req->in; > reqsize = in->h.len; > > - if (task_active_pid_ns(current) != fc->pid_ns) { > + if (task_active_pid_ns(current) != fc->pid_ns || > + current_user_ns() != fc->user_ns) { I don't get it. Why recalculate the pid if the user_ns does not match? > rcu_read_lock(); > in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); > rcu_read_unlock(); > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 24967382..ad1cfac1 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, > stat->ino = attr->ino; > stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); > stat->nlink = attr->nlink; > - stat->uid = make_kuid(&init_user_ns, attr->uid); > - stat->gid = make_kgid(&init_user_ns, attr->gid); > + stat->uid = make_kuid(fc->user_ns, attr->uid); > + stat->gid = make_kgid(fc->user_ns, attr->gid); > stat->rdev = inode->i_rdev; > stat->atime.tv_sec = attr->atime; > stat->atime.tv_nsec = attr->atimensec; > @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) > return true; > } > > -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, > - bool trust_local_cmtime) > +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, > + struct fuse_setattr_in *arg, bool trust_local_cmtime) > { > unsigned ivalid = iattr->ia_valid; > > if (ivalid & ATTR_MODE) > arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; > if (ivalid & ATTR_UID) > - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid); > + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); > if (ivalid & ATTR_GID) > - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid); > + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid); > if (ivalid & ATTR_SIZE) > arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size; > if (ivalid & ATTR_ATIME) { > @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, > > memset(&inarg, 0, sizeof(inarg)); > memset(&outarg, 0, sizeof(outarg)); > - iattr_to_fattr(attr, &inarg, trust_local_cmtime); > + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime); > if (file) { > struct fuse_file *ff = file->private_data; > inarg.valid |= FATTR_FH; > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index d5773ca6..364e65c8 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -26,6 +26,7 @@ > #include > #include > #include > +#include > > /** Max number of pages that can be used in a single read request */ > #define FUSE_MAX_PAGES_PER_REQ 32 > @@ -466,6 +467,9 @@ struct fuse_conn { > /** The pid namespace for this mount */ > struct pid_namespace *pid_ns; > > + /** The user namespace for this mount */ > + struct user_namespace *user_ns; > + > /** Maximum read size */ > unsigned max_read; > > @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc); > /** > * Initialize fuse_conn > */ > -void fuse_conn_init(struct fuse_conn *fc); > +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns); > > /** > * Release reference to fuse_conn > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 2f504d61..7f6b2e55 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, > inode->i_ino = fuse_squash_ino(attr->ino); > inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); > set_nlink(inode, attr->nlink); > - inode->i_uid = make_kuid(&init_user_ns, attr->uid); > - inode->i_gid = make_kgid(&init_user_ns, attr->gid); > + inode->i_uid = make_kuid(fc->user_ns, attr->uid); > + inode->i_gid = make_kgid(fc->user_ns, attr->gid); > inode->i_blocks = attr->blocks; > inode->i_atime.tv_sec = attr->atime; > inode->i_atime.tv_nsec = attr->atimensec; > @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res) > return err; > } > > -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) > +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev, > + struct user_namespace *user_ns) > { > char *p; > memset(d, 0, sizeof(struct fuse_mount_data)); > @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) > case OPT_USER_ID: > if (fuse_match_uint(&args[0], &uv)) > return 0; > - d->user_id = make_kuid(current_user_ns(), uv); > + d->user_id = make_kuid(user_ns, uv); > if (!uid_valid(d->user_id)) > return 0; > d->user_id_present = 1; > @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) > case OPT_GROUP_ID: > if (fuse_match_uint(&args[0], &uv)) > return 0; > - d->group_id = make_kgid(current_user_ns(), uv); > + d->group_id = make_kgid(user_ns, uv); > if (!gid_valid(d->group_id)) > return 0; > d->group_id_present = 1; > @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) > struct super_block *sb = root->d_sb; > struct fuse_conn *fc = get_fuse_conn_super(sb); > > - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id)); > - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id)); > + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id)); > + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id)); > if (fc->default_permissions) > seq_puts(m, ",default_permissions"); > if (fc->allow_other) > @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq) > fpq->connected = 1; > } > > -void fuse_conn_init(struct fuse_conn *fc) > +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns) > { > memset(fc, 0, sizeof(*fc)); > spin_lock_init(&fc->lock); > @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc) > fc->attr_version = 1; > get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); > fc->pid_ns = get_pid_ns(task_active_pid_ns(current)); > + fc->user_ns = get_user_ns(user_ns); > } > EXPORT_SYMBOL_GPL(fuse_conn_init); > > @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc) > if (fc->destroy_req) > fuse_request_free(fc->destroy_req); > put_pid_ns(fc->pid_ns); > + put_user_ns(fc->user_ns); > fc->release(fc); > } > } > @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) > > sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION); > > - if (!parse_fuse_opt(data, &d, is_bdev)) > + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns)) > goto err; > > if (is_bdev) { > @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) > if (!file) > goto err; > > - if ((file->f_op != &fuse_dev_operations) || > - (file->f_cred->user_ns != &init_user_ns)) > + /* > + * Require mount to happen from the same user namespace which > + * opened /dev/fuse to prevent potential attacks. > + */ > + if (file->f_op != &fuse_dev_operations || > + file->f_cred->user_ns != sb->s_user_ns) > goto err_fput; > > fc = kmalloc(sizeof(*fc), GFP_KERNEL); > @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) > if (!fc) > goto err_fput; > > - fuse_conn_init(fc); > + fuse_conn_init(fc, sb->s_user_ns); > fc->release = fuse_free_conn; > > fud = fuse_dev_alloc(fc); > -- > 2.13.6 > From ebiederm at xmission.com Mon Feb 12 16:35:10 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 12 Feb 2018 10:35:10 -0600 Subject: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: (Miklos Szeredi's message of "Mon, 12 Feb 2018 16:57:31 +0100") References: Message-ID: <87lgfy5fpd.fsf@xmission.com> Miklos Szeredi writes: > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >> From: Seth Forshee >> >> In order to support mounts from namespaces other than >> init_user_ns, fuse must translate uids and gids to/from the >> userns of the process servicing requests on /dev/fuse. This >> patch does that, with a couple of restrictions on the namespace: >> >> - The userns for the fuse connection is fixed to the namespace >> from which /dev/fuse is opened. >> >> - The namespace must be the same as s_user_ns. >> >> These restrictions simplify the implementation by avoiding the >> need to pass around userns references and by allowing fuse to >> rely on the checks in inode_change_ok for ownership changes. >> Either restriction could be relaxed in the future if needed. > > Can we not introduce potential userspace interface regressions? > > The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse: > allow server to run in different pid_ns") will probably bite us here > as well. Maybe, but unlike the pid namespace no one has been able to mount fuse outside of init_user_ns so we are much less exposed. I agree we should be careful. > We basically need two modes of operation: > > a) old, backward compatible (not introducing any new failure mores), > created with privileged mount > b) new, non-backward compatible, created with unprivileged mount > > Technically there would still be a risk from breaking userspace, since > we are using the same entry point for both, but let's hope that no > practical problems come from that. Answering from a 10,000 foot perspective: There are two cases. Requests to read/write the filesystem from outside of s_user_ns. These run no risk of breaking userspace as this mode has not been implemented before. Restrictions at mount time to ensure we are not dealing with a crazy mix of namespaces. This has a small chance of breaking someone's crazy setup. Dropping requests to read/write the filesystem when the requester does not map into s_user_ns should not be a problem to enable universally. If s_user_ns is init_user_ns everything maps so there is no restriction. What we can do if we want to ensure maximum backwards compatibility is if the fuse filesystem is mounted in init_user_ns but if device for the communication channel is opened in some other user namespace we can just force the communication channel to operate in init_user_ns. That will be 100% backwards compatible in all cases and as far as I can see remove the need for having different ``modes'' of operation. This does look like the time to give all of this a hard look and see if we can get these patches in shape to be merged. Eric >> For cuse the namespace used for the connection is also simply >> current_user_ns() at the time /dev/cuse is opened. >> >> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/ >> >> Cc: linux-fsdevel at vger.kernel.org >> Cc: linux-kernel at vger.kernel.org >> Cc: Miklos Szeredi >> Signed-off-by: Seth Forshee >> Signed-off-by: Dongsu Park >> --- >> fs/fuse/cuse.c | 3 ++- >> fs/fuse/dev.c | 11 ++++++++--- >> fs/fuse/dir.c | 14 +++++++------- >> fs/fuse/fuse_i.h | 6 +++++- >> fs/fuse/inode.c | 31 +++++++++++++++++++------------ >> 5 files changed, 41 insertions(+), 24 deletions(-) >> >> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c >> index e9e97803..b1b83259 100644 >> --- a/fs/fuse/cuse.c >> +++ b/fs/fuse/cuse.c >> @@ -48,6 +48,7 @@ >> #include >> #include >> #include >> +#include >> >> #include "fuse_i.h" >> >> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file) >> if (!cc) >> return -ENOMEM; >> >> - fuse_conn_init(&cc->fc); >> + fuse_conn_init(&cc->fc, current_user_ns()); >> >> fud = fuse_dev_alloc(&cc->fc); >> if (!fud) { >> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c >> index 17f0d05b..0f780e16 100644 >> --- a/fs/fuse/dev.c >> +++ b/fs/fuse/dev.c >> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) >> >> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) >> { >> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); >> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); >> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); >> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); >> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); >> } >> >> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, >> __set_bit(FR_WAITING, &req->flags); >> if (for_background) >> __set_bit(FR_BACKGROUND, &req->flags); >> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) { >> + fuse_put_request(fc, req); >> + return ERR_PTR(-EOVERFLOW); >> + } >> >> return req; >> >> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, >> in = &req->in; >> reqsize = in->h.len; >> >> - if (task_active_pid_ns(current) != fc->pid_ns) { >> + if (task_active_pid_ns(current) != fc->pid_ns || >> + current_user_ns() != fc->user_ns) { > > I don't get it. Why recalculate the pid if the user_ns does not match? > >> rcu_read_lock(); >> in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); >> rcu_read_unlock(); >> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c >> index 24967382..ad1cfac1 100644 >> --- a/fs/fuse/dir.c >> +++ b/fs/fuse/dir.c >> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, >> stat->ino = attr->ino; >> stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); >> stat->nlink = attr->nlink; >> - stat->uid = make_kuid(&init_user_ns, attr->uid); >> - stat->gid = make_kgid(&init_user_ns, attr->gid); >> + stat->uid = make_kuid(fc->user_ns, attr->uid); >> + stat->gid = make_kgid(fc->user_ns, attr->gid); >> stat->rdev = inode->i_rdev; >> stat->atime.tv_sec = attr->atime; >> stat->atime.tv_nsec = attr->atimensec; >> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) >> return true; >> } >> >> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, >> - bool trust_local_cmtime) >> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, >> + struct fuse_setattr_in *arg, bool trust_local_cmtime) >> { >> unsigned ivalid = iattr->ia_valid; >> >> if (ivalid & ATTR_MODE) >> arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; >> if (ivalid & ATTR_UID) >> - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid); >> + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); >> if (ivalid & ATTR_GID) >> - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid); >> + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid); >> if (ivalid & ATTR_SIZE) >> arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size; >> if (ivalid & ATTR_ATIME) { >> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, >> >> memset(&inarg, 0, sizeof(inarg)); >> memset(&outarg, 0, sizeof(outarg)); >> - iattr_to_fattr(attr, &inarg, trust_local_cmtime); >> + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime); >> if (file) { >> struct fuse_file *ff = file->private_data; >> inarg.valid |= FATTR_FH; >> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h >> index d5773ca6..364e65c8 100644 >> --- a/fs/fuse/fuse_i.h >> +++ b/fs/fuse/fuse_i.h >> @@ -26,6 +26,7 @@ >> #include >> #include >> #include >> +#include >> >> /** Max number of pages that can be used in a single read request */ >> #define FUSE_MAX_PAGES_PER_REQ 32 >> @@ -466,6 +467,9 @@ struct fuse_conn { >> /** The pid namespace for this mount */ >> struct pid_namespace *pid_ns; >> >> + /** The user namespace for this mount */ >> + struct user_namespace *user_ns; >> + >> /** Maximum read size */ >> unsigned max_read; >> >> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc); >> /** >> * Initialize fuse_conn >> */ >> -void fuse_conn_init(struct fuse_conn *fc); >> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns); >> >> /** >> * Release reference to fuse_conn >> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c >> index 2f504d61..7f6b2e55 100644 >> --- a/fs/fuse/inode.c >> +++ b/fs/fuse/inode.c >> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, >> inode->i_ino = fuse_squash_ino(attr->ino); >> inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); >> set_nlink(inode, attr->nlink); >> - inode->i_uid = make_kuid(&init_user_ns, attr->uid); >> - inode->i_gid = make_kgid(&init_user_ns, attr->gid); >> + inode->i_uid = make_kuid(fc->user_ns, attr->uid); >> + inode->i_gid = make_kgid(fc->user_ns, attr->gid); >> inode->i_blocks = attr->blocks; >> inode->i_atime.tv_sec = attr->atime; >> inode->i_atime.tv_nsec = attr->atimensec; >> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res) >> return err; >> } >> >> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) >> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev, >> + struct user_namespace *user_ns) >> { >> char *p; >> memset(d, 0, sizeof(struct fuse_mount_data)); >> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) >> case OPT_USER_ID: >> if (fuse_match_uint(&args[0], &uv)) >> return 0; >> - d->user_id = make_kuid(current_user_ns(), uv); >> + d->user_id = make_kuid(user_ns, uv); >> if (!uid_valid(d->user_id)) >> return 0; >> d->user_id_present = 1; >> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) >> case OPT_GROUP_ID: >> if (fuse_match_uint(&args[0], &uv)) >> return 0; >> - d->group_id = make_kgid(current_user_ns(), uv); >> + d->group_id = make_kgid(user_ns, uv); >> if (!gid_valid(d->group_id)) >> return 0; >> d->group_id_present = 1; >> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) >> struct super_block *sb = root->d_sb; >> struct fuse_conn *fc = get_fuse_conn_super(sb); >> >> - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id)); >> - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id)); >> + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id)); >> + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id)); >> if (fc->default_permissions) >> seq_puts(m, ",default_permissions"); >> if (fc->allow_other) >> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq) >> fpq->connected = 1; >> } >> >> -void fuse_conn_init(struct fuse_conn *fc) >> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns) >> { >> memset(fc, 0, sizeof(*fc)); >> spin_lock_init(&fc->lock); >> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc) >> fc->attr_version = 1; >> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); >> fc->pid_ns = get_pid_ns(task_active_pid_ns(current)); >> + fc->user_ns = get_user_ns(user_ns); >> } >> EXPORT_SYMBOL_GPL(fuse_conn_init); >> >> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc) >> if (fc->destroy_req) >> fuse_request_free(fc->destroy_req); >> put_pid_ns(fc->pid_ns); >> + put_user_ns(fc->user_ns); >> fc->release(fc); >> } >> } >> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) >> >> sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION); >> >> - if (!parse_fuse_opt(data, &d, is_bdev)) >> + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns)) >> goto err; >> >> if (is_bdev) { >> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) >> if (!file) >> goto err; >> >> - if ((file->f_op != &fuse_dev_operations) || >> - (file->f_cred->user_ns != &init_user_ns)) >> + /* >> + * Require mount to happen from the same user namespace which >> + * opened /dev/fuse to prevent potential attacks. >> + */ >> + if (file->f_op != &fuse_dev_operations || >> + file->f_cred->user_ns != sb->s_user_ns) >> goto err_fput; >> >> fc = kmalloc(sizeof(*fc), GFP_KERNEL); >> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) >> if (!fc) >> goto err_fput; >> >> - fuse_conn_init(fc); >> + fuse_conn_init(fc, sb->s_user_ns); >> fc->release = fuse_free_conn; >> >> fud = fuse_dev_alloc(fc); >> -- >> 2.13.6 >> From mszeredi at redhat.com Tue Feb 13 10:20:07 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Tue, 13 Feb 2018 11:20:07 +0100 Subject: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: <87lgfy5fpd.fsf@xmission.com> References: <87lgfy5fpd.fsf@xmission.com> Message-ID: On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman wrote: > Miklos Szeredi writes: > >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >>> From: Seth Forshee >>> >>> In order to support mounts from namespaces other than >>> init_user_ns, fuse must translate uids and gids to/from the >>> userns of the process servicing requests on /dev/fuse. This >>> patch does that, with a couple of restrictions on the namespace: >>> >>> - The userns for the fuse connection is fixed to the namespace >>> from which /dev/fuse is opened. >>> >>> - The namespace must be the same as s_user_ns. >>> >>> These restrictions simplify the implementation by avoiding the >>> need to pass around userns references and by allowing fuse to >>> rely on the checks in inode_change_ok for ownership changes. >>> Either restriction could be relaxed in the future if needed. >> >> Can we not introduce potential userspace interface regressions? >> >> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse: >> allow server to run in different pid_ns") will probably bite us here >> as well. > > Maybe, but unlike the pid namespace no one has been able to mount > fuse outside of init_user_ns so we are much less exposed. I agree we > should be careful. Have to wrap my head around all the rules here. There's the may_mount() one: ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN) Um, first of all, why isn't it checking current->cred->user_ns? Ah, there it is in sget(): ns_capable(user_ns, CAP_SYS_ADMIN) I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs doesn't have FS_USERNS_MOUNT. This is the one that prevents fuse mounts from being created when (current->cred->user_ns != &init_user_ns). Maybe there's a logic to this web of namespaces, but I don't yet see it. Is it documented somewhere? >> We basically need two modes of operation: >> >> a) old, backward compatible (not introducing any new failure mores), >> created with privileged mount >> b) new, non-backward compatible, created with unprivileged mount >> >> Technically there would still be a risk from breaking userspace, since >> we are using the same entry point for both, but let's hope that no >> practical problems come from that. > > Answering from a 10,000 foot perspective: > > There are two cases. Requests to read/write the filesystem from outside > of s_user_ns. These run no risk of breaking userspace as this mode has > not been implemented before. This comes from the fact that (s_user_ns == &init_user_ns) and all user namespaces are "inside" init_user_ns, right? One question: why does current code use the from_[ug]id_munged() variant, when the conversion can never fail. Or can it? > Restrictions at mount time to ensure we are not dealing with a crazy mix > of namespaces. This has a small chance of breaking someone's crazy > setup. > > > Dropping requests to read/write the filesystem when the requester does > not map into s_user_ns should not be a problem to enable universally. If > s_user_ns is init_user_ns everything maps so there is no restriction. > > > > What we can do if we want to ensure maximum backwards compatibility > is if the fuse filesystem is mounted in init_user_ns but if device for > the communication channel is opened in some other user namespace we > can just force the communication channel to operate in init_user_ns. > > That will be 100% backwards compatible in all cases and as far as I can > see remove the need for having different ``modes'' of operation. Okay. Thanks, Miklos From mszeredi at redhat.com Tue Feb 13 11:32:09 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Tue, 13 Feb 2018 12:32:09 +0100 Subject: [PATCH v5 00/11] FUSE mounts from non-init user namespaces In-Reply-To: References: Message-ID: On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > Patches 1-2 deal with an additional flag of lookup_bdev() to check for > additional inode permission. fuse_blk is less suitable for unprivileged mounting than plain fuse. fusermount doesn't allow mounting fuse_blk unprivileged, so there's little data about that usecase (IIRC ntfs3g guys did that, or at least tried to do it, but I don't remember the details). As such, I think we should leave it out of the initial version. Which means you can drop patches 1-2 from this series. Unless there's a strong use case for this. In which case we should look hard at the differences between fuse_blk and fuse and how that affects unprivileged operation. There are a few assumptions about fuse_blk filesystem being more "well behaved", I think. Thanks, Miklos From mszeredi at redhat.com Tue Feb 13 13:18:21 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Tue, 13 Feb 2018 14:18:21 +0100 Subject: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes In-Reply-To: References: Message-ID: On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > From: Eric W. Biederman > > Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to > chown files. Ordinarily the capable_wrt_inode_uidgid check is > sufficient to allow access to files but when the underlying filesystem > has uids or gids that don't map to the current user namespace it is > not enough, so the chown permission checks need to be extended to > allow this case. > > Calling chown on filesystem nodes whose uid or gid don't map is > necessary if those nodes are going to be modified as writing back > inodes which contain uids or gids that don't map is likely to cause > filesystem corruption of the uid or gid fields. How can the filesystem be corrupted if chown is denied? It is not clear to me what the purpose of this patch is or what the exact usecase this is fixing. Thanks, Miklos From mszeredi at redhat.com Tue Feb 13 13:37:12 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Tue, 13 Feb 2018 14:37:12 +0100 Subject: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root In-Reply-To: References: <20171223032606.GD6837@mail.hallyn.com> Message-ID: On Sat, Dec 23, 2017 at 1:38 PM, Dongsu Park wrote: > Hi, > > On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn wrote: >> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote: >>> From: Seth Forshee >>> >>> Expand the check in should_remove_suid() to keep privileges for >> >> I realize this description came from Seth, but reading it now, >> 'Expand' seems wrong. Expanding a check brings to my mind making >> it stricter, not looser. How about 'Relax the check' ? > > Makes sense. Will do. > >>> CAP_FSETID in s_user_ns rather than init_user_ns. >>> >>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/ >>> >>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid >> >> Why exactly? >> >> This is wrong, because capable_wrt_inode_uidgid() does a check >> against current_user_ns, not the inode->i_sb->s_user_ns I'm thoroughly confused. s_user_ns is supposed to be about the usernamespace the filesystem perceives to be in, right? How does that come into play when checking permissions to do something? Thanks, Miklos From sargun at sargun.me Tue Feb 13 15:42:46 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 13 Feb 2018 15:42:46 +0000 Subject: [PATCH net-next 0/3] eBPF Seccomp filters Message-ID: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> This patchset enables seccomp filters to be written in eBPF. Although, this patchset doesn't introduce much of the functionality enabled by eBPF, it lays the ground work for it. It also introduces the capability to dump eBPF filters via the PTRACE API in order to make it so that CHECKPOINT_RESTORE will be satisifed. In the attached samples, there's an example of this. One can then use BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, and use that at reload time. The primary reason for not adding maps support in this patchset is to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. If we have a map that the BPF program can read, it can potentially "change" privileges after running. It seems like doing writes only is safe, because it can be pure, and side effect free, and therefore not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come to an agreement, this can be in a follow-up patchset. Sargun Dhillon (3): bpf, seccomp: Add eBPF filter capabilities seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters bpf: Add eBPF seccomp sample programs arch/Kconfig | 7 ++ include/linux/bpf_types.h | 3 + include/linux/seccomp.h | 12 +++ include/uapi/linux/bpf.h | 2 + include/uapi/linux/ptrace.h | 5 +- include/uapi/linux/seccomp.h | 15 ++-- kernel/bpf/syscall.c | 1 + kernel/ptrace.c | 3 + kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++++++++++----- samples/bpf/Makefile | 9 +++ samples/bpf/bpf_load.c | 9 ++- samples/bpf/seccomp1_kern.c | 17 ++++ samples/bpf/seccomp1_user.c | 34 ++++++++ samples/bpf/seccomp2_kern.c | 24 ++++++ samples/bpf/seccomp2_user.c | 66 +++++++++++++++ 15 files changed, 362 insertions(+), 30 deletions(-) create mode 100644 samples/bpf/seccomp1_kern.c create mode 100644 samples/bpf/seccomp1_user.c create mode 100644 samples/bpf/seccomp2_kern.c create mode 100644 samples/bpf/seccomp2_user.c -- 2.14.1 From sargun at sargun.me Tue Feb 13 15:42:57 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 13 Feb 2018 15:42:57 +0000 Subject: [PATCH net-next 1/3] bpf, seccomp: Add eBPF filter capabilities Message-ID: <20180213154255.GA3301@ircssh-2.c.rugged-nimbus-611.internal> From: Sargun Dhillon This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant to be used for seccomp filters as an alternative to cBPF filters. The program type has relatively limited capabilities in terms of helpers, but that can be extended later on. It also introduces a new mechanism to attach these filters via the prctl and seccomp syscalls -- SECCOMP_MODE_FILTER_EXTENDED, and SECCOMP_SET_MODE_FILTER_EXTENDED respectively. Signed-off-by: Sargun Dhillon --- arch/Kconfig | 7 ++ include/linux/bpf_types.h | 3 + include/uapi/linux/bpf.h | 2 + include/uapi/linux/seccomp.h | 15 +++-- kernel/bpf/syscall.c | 1 + kernel/seccomp.c | 148 +++++++++++++++++++++++++++++++++++++------ 6 files changed, 150 insertions(+), 26 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 76c0b54443b1..db675888577c 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -401,6 +401,13 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config SECCOMP_FILTER_EXTENDED + bool "Extended BPF seccomp filters" + depends on SECCOMP_FILTER && BPF_SYSCALL + help + Enables seccomp filters to be written in eBPF, as opposed + to just cBPF filters. + config HAVE_GCC_PLUGINS bool help diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 19b8349a3809..945c65c4e461 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) #ifdef CONFIG_CGROUP_BPF BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) #endif +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index db6bdc375126..5f96cb7ed954 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1,3 +1,4 @@ + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * @@ -133,6 +134,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_CGROUP_DEVICE, + BPF_PROG_TYPE_SECCOMP, }; enum bpf_attach_type { diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 2a0bd9dd104d..7da8b39f2a6a 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -7,14 +7,17 @@ /* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, ) */ -#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ -#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ -#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ +#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ +#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ +#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ +#define SECCOMP_MODE_FILTER_EXTENDED 3 /* uses eBPF filter from fd */ /* Valid operations for seccomp syscall. */ -#define SECCOMP_SET_MODE_STRICT 0 -#define SECCOMP_SET_MODE_FILTER 1 -#define SECCOMP_GET_ACTION_AVAIL 2 +#define SECCOMP_SET_MODE_STRICT 0 +#define SECCOMP_SET_MODE_FILTER 1 +#define SECCOMP_GET_ACTION_AVAIL 2 +#define SECCOMP_SET_MODE_FILTER_EXTENDED 3 + /* Valid flags for SECCOMP_SET_MODE_FILTER */ #define SECCOMP_FILTER_FLAG_TSYNC 1 diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e24aa3241387..86d6ec8b916d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1202,6 +1202,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && + type != BPF_PROG_TYPE_SECCOMP && !capable(CAP_SYS_ADMIN)) return -EPERM; diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 940fa408a288..b30dd25c1cb8 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -37,6 +37,7 @@ #include #include #include +#include /** * struct seccomp_filter - container for seccomp BPF programs @@ -367,17 +368,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); - /* - * Installing a seccomp filter requires that the task has - * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. - * This avoids scenarios where unprivileged tasks can affect the - * behavior of privileged children. - */ - if (!task_no_new_privs(current) && - security_capable_noaudit(current_cred(), current_user_ns(), - CAP_SYS_ADMIN) != 0) - return ERR_PTR(-EACCES); - /* Allocate a new seccomp_filter */ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); if (!sfilter) @@ -423,6 +413,48 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +/** + * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd + * @user_filter: pointer to the user data containing an fd. + * + * Returns 0 on success and non-zero otherwise. + */ +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *user_fd) +{ + struct seccomp_filter *sfilter; + struct bpf_prog *fp; + int fd; + + /* Fetch the fd from userspace */ + if (get_user(fd, (int __user *)user_fd)) + return ERR_PTR(-EFAULT); + + /* Allocate a new seccomp_filter */ + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); + if (!sfilter) + return ERR_PTR(-ENOMEM); + + fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP); + if (IS_ERR(fp)) { + kfree(sfilter); + return ERR_CAST(fp); + } + + sfilter->prog = fp; + refcount_set(&sfilter->usage, 1); + + return sfilter; +} +#else +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *filter_fd) +{ + return ERR_PTR(-EINVAL); +} +#endif + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -492,7 +524,10 @@ void get_seccomp_filter(struct task_struct *tsk) static inline void seccomp_filter_free(struct seccomp_filter *filter) { if (filter) { - bpf_prog_destroy(filter->prog); + if (bpf_prog_was_classic(filter->prog)) + bpf_prog_destroy(filter->prog); + else + bpf_prog_put(filter->prog); kfree(filter); } } @@ -842,18 +877,36 @@ static long seccomp_set_mode_strict(void) * Returns 0 on success or -EINVAL on failure. */ static long seccomp_set_mode_filter(unsigned int flags, - const char __user *filter) + const char __user *filter, + unsigned long filter_type) { - const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; + /* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */ + const unsigned long filter_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; /* Validate flags. */ if (flags & ~SECCOMP_FILTER_FLAG_MASK) return -EINVAL; + /* + * Installing a seccomp filter requires that the task has + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. + * This avoids scenarios where unprivileged tasks can affect the + * behavior of privileged children. + */ + if (!task_no_new_privs(current) && + security_capable_noaudit(current_cred(), current_user_ns(), + CAP_SYS_ADMIN) != 0) + return -EACCES; /* Prepare the new filter before holding any locks. */ - prepared = seccomp_prepare_user_filter(filter); + if (filter_type == SECCOMP_SET_MODE_FILTER_EXTENDED) + prepared = seccomp_prepare_extended_filter(filter); + else if (filter_type == SECCOMP_SET_MODE_FILTER) + prepared = seccomp_prepare_user_filter(filter); + else + return -EINVAL; + if (IS_ERR(prepared)) return PTR_ERR(prepared); @@ -867,7 +920,7 @@ static long seccomp_set_mode_filter(unsigned int flags, spin_lock_irq(¤t->sighand->siglock); - if (!seccomp_may_assign_mode(seccomp_mode)) + if (!seccomp_may_assign_mode(filter_mode)) goto out; ret = seccomp_attach_filter(flags, prepared); @@ -876,7 +929,7 @@ static long seccomp_set_mode_filter(unsigned int flags, /* Do not free the successfully attached filter. */ prepared = NULL; - seccomp_assign_mode(current, seccomp_mode); + seccomp_assign_mode(current, filter_mode); out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) @@ -926,7 +979,9 @@ static long do_seccomp(unsigned int op, unsigned int flags, return -EINVAL; return seccomp_set_mode_strict(); case SECCOMP_SET_MODE_FILTER: - return seccomp_set_mode_filter(flags, uargs); + return seccomp_set_mode_filter(flags, uargs, op); + case SECCOMP_SET_MODE_FILTER_EXTENDED: + return seccomp_set_mode_filter(flags, uargs, op); case SECCOMP_GET_ACTION_AVAIL: if (flags != 0) return -EINVAL; @@ -969,6 +1024,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) op = SECCOMP_SET_MODE_FILTER; uargs = filter; break; + case SECCOMP_MODE_FILTER_EXTENDED: + op = SECCOMP_SET_MODE_FILTER_EXTENDED; + uargs = filter; + break; default: return -EINVAL; } @@ -1040,8 +1099,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, if (IS_ERR(filter)) return PTR_ERR(filter); - fprog = filter->prog->orig_prog; - if (!fprog) { + if (!bpf_prog_was_classic(filter->prog)) { /* This must be a new non-cBPF filter, since we save * every cBPF filter's orig_prog above when * CONFIG_CHECKPOINT_RESTORE is enabled. @@ -1050,6 +1108,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, goto out; } + fprog = filter->prog->orig_prog; ret = fprog->len; if (!data) goto out; @@ -1239,6 +1298,55 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, return 0; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +static bool seccomp_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (type != BPF_READ) + return false; + + if (off < 0 || off + size > sizeof(struct seccomp_data)) + return false; + + switch (off) { + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): + return (size == sizeof(__u64)); + case bpf_ctx_range(struct seccomp_data, nr): + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); + case bpf_ctx_range(struct seccomp_data, arch): + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); + case bpf_ctx_range(struct seccomp_data, instruction_pointer): + return (size == FIELD_SIZEOF(struct seccomp_data, + instruction_pointer)); + } + + return false; +} + +static const struct bpf_func_proto * +seccomp_func_proto(enum bpf_func_id func_id) +{ + switch (func_id) { + case BPF_FUNC_get_current_uid_gid: + return &bpf_get_current_uid_gid_proto; + case BPF_FUNC_trace_printk: + if (capable(CAP_SYS_ADMIN)) + return bpf_get_trace_printk_proto(); + default: + return NULL; + } +} + +const struct bpf_prog_ops seccomp_prog_ops = { +}; + +const struct bpf_verifier_ops seccomp_verifier_ops = { + .get_func_proto = seccomp_func_proto, + .is_valid_access = seccomp_is_valid_access, +}; +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ + static struct ctl_path seccomp_sysctl_path[] = { { .procname = "kernel", }, { .procname = "seccomp", }, -- 2.14.1 From sargun at sargun.me Tue Feb 13 15:43:10 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 13 Feb 2018 15:43:10 +0000 Subject: [PATCH net-next 2/3] seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters Message-ID: <20180213154308.GA3310@ircssh-2.c.rugged-nimbus-611.internal> From: Sargun Dhillon This extends the the ptrace API to allow fetching eBPF seccomp filters attached to programs. This is to enable checkpoint / restore cases. The user will have to use the traditional PTRACE_SECCOMP_GET_FILTER API call, and if they get an invalid medium type error they can switch over to the eBPF variant of the API -- PTRACE_SECCOMP_GET_FILTER_EXTENDED. Signed-off-by: Sargun Dhillon --- include/linux/seccomp.h | 12 ++++++++++++ include/uapi/linux/ptrace.h | 5 +++-- kernel/ptrace.c | 3 +++ kernel/seccomp.c | 37 +++++++++++++++++++++++++++++++++++++ 4 files changed, 55 insertions(+), 2 deletions(-) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index c723a5c4e3ff..97fdbcffacc2 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -110,4 +110,16 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ +#if defined(CONFIG_SECCOMP_FILTER_EXTENDED) && defined(CONFIG_CHECKPOINT_RESTORE) +extern long seccomp_get_filter_extended(struct task_struct *task, + unsigned long n, + void __user *data); +#else +static inline long seccomp_get_filter_extended(struct task_struct *task, + unsigned long n, + void __user *data) +{ + return -EINVAL; +} +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED && CONFIG_CHECKPOINT_RESTORE */ #endif /* _LINUX_SECCOMP_H */ diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index e46d82b91166..c619eb46b9d9 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -65,8 +65,9 @@ struct ptrace_peeksiginfo_args { #define PTRACE_GETSIGMASK 0x420a #define PTRACE_SETSIGMASK 0x420b -#define PTRACE_SECCOMP_GET_FILTER 0x420c -#define PTRACE_SECCOMP_GET_METADATA 0x420d +#define PTRACE_SECCOMP_GET_FILTER 0x420c +#define PTRACE_SECCOMP_GET_METADATA 0x420d +#define PTRACE_SECCOMP_GET_FILTER_EXTENDED 0x420e struct seccomp_metadata { unsigned long filter_off; /* Input: which filter */ diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 21fec73d45d4..90c62f9e1a55 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -1096,6 +1096,9 @@ int ptrace_request(struct task_struct *child, long request, ret = seccomp_get_metadata(child, addr, datavp); break; + case PTRACE_SECCOMP_GET_FILTER_EXTENDED: + ret = seccomp_get_filter_extended(child, addr, datavp); + default: break; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b30dd25c1cb8..931a13a8cd63 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1155,6 +1155,43 @@ long seccomp_get_metadata(struct task_struct *task, } #endif +#if defined(CONFIG_SECCOMP_FILTER_EXTENDED) && defined(CONFIG_CHECKPOINT_RESTORE) +long seccomp_get_filter_extended(struct task_struct *task, + unsigned long filter_off, + void __user *data) +{ + struct seccomp_filter *filter; + struct bpf_prog *prog; + long ret; + + if (!capable(CAP_SYS_ADMIN) || + current->seccomp.mode != SECCOMP_MODE_DISABLED) { + return -EACCES; + } + + filter = get_nth_filter(task, filter_off); + if (IS_ERR(filter)) + return PTR_ERR(filter); + + if (bpf_prog_was_classic(filter->prog)) { + ret = -EMEDIUMTYPE; + goto out; + } + prog = bpf_prog_inc_not_zero(filter->prog); + if (IS_ERR(prog)) { + ret = PTR_ERR(prog); + goto out; + } + + ret = bpf_prog_new_fd(filter->prog); + if (ret < 0) + bpf_prog_put(prog); +out: + __put_seccomp_filter(filter); + return ret; +} +#endif + #ifdef CONFIG_SYSCTL /* Human readable action names for friendly sysctl interaction */ -- 2.14.1 From sargun at sargun.me Tue Feb 13 15:43:22 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 13 Feb 2018 15:43:22 +0000 Subject: [PATCH net-next 3/3] bpf: Add eBPF seccomp sample programs Message-ID: <20180213154320.GA3319@ircssh-2.c.rugged-nimbus-611.internal> From: Sargun Dhillon This adds two sample programs: seccomp1: A simple eBPF seccomp filter seccomp2: A program which installs an eBPF filter and then retrieves it via ptrace to show checkpoint / restore capability. Signed-off-by: Sargun Dhillon --- samples/bpf/Makefile | 9 +++++++ samples/bpf/bpf_load.c | 9 +++++-- samples/bpf/seccomp1_kern.c | 17 ++++++++++++ samples/bpf/seccomp1_user.c | 34 +++++++++++++++++++++++ samples/bpf/seccomp2_kern.c | 24 +++++++++++++++++ samples/bpf/seccomp2_user.c | 66 +++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 157 insertions(+), 2 deletions(-) create mode 100644 samples/bpf/seccomp1_kern.c create mode 100644 samples/bpf/seccomp1_user.c create mode 100644 samples/bpf/seccomp2_kern.c create mode 100644 samples/bpf/seccomp2_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ec3fc8d88e87..f1ba5fa18db7 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -43,6 +43,8 @@ hostprogs-y += xdp_redirect_cpu hostprogs-y += xdp_monitor hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp +hostprogs-y += seccomp1 +hostprogs-y += seccomp2 # Libbpf dependencies LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o @@ -93,6 +95,9 @@ xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o +seccomp1-objs := bpf_load.o $(LIBBPF) seccomp1_user.o +seccomp2-objs := bpf_load.o $(LIBBPF) seccomp2_user.o + # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -144,6 +149,8 @@ always += xdp_monitor_kern.o always += xdp_rxq_info_kern.o always += xdp2skb_meta_kern.o always += syscall_tp_kern.o +always += seccomp1_kern.o +always += seccomp2_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -188,6 +195,8 @@ HOSTLOADLIBES_xdp_redirect_cpu += -lelf HOSTLOADLIBES_xdp_monitor += -lelf HOSTLOADLIBES_xdp_rxq_info += -lelf HOSTLOADLIBES_syscall_tp += -lelf +HOSTLOADLIBES_seccomp1 += -lelf +HOSTLOADLIBES_seccomp2 += -lelf # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 69806d74fa53..856bc8b93916 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -67,6 +67,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0; bool is_sockops = strncmp(event, "sockops", 7) == 0; bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0; + bool is_seccomp = strncmp(event, "seccomp", 7) == 0; size_t insns_cnt = size / sizeof(struct bpf_insn); enum bpf_prog_type prog_type; char buf[256]; @@ -96,6 +97,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_type = BPF_PROG_TYPE_SOCK_OPS; } else if (is_sk_skb) { prog_type = BPF_PROG_TYPE_SK_SKB; + } else if (is_seccomp) { + prog_type = BPF_PROG_TYPE_SECCOMP; } else { printf("Unknown event '%s'\n", event); return -1; @@ -110,7 +113,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_fd[prog_cnt++] = fd; - if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk) + if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk || + is_seccomp) return 0; if (is_socket || is_sockops || is_sk_skb) { @@ -589,7 +593,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map) memcmp(shname, "socket", 6) == 0 || memcmp(shname, "cgroup/", 7) == 0 || memcmp(shname, "sockops", 7) == 0 || - memcmp(shname, "sk_skb", 6) == 0) { + memcmp(shname, "sk_skb", 6) == 0 || + memcmp(shname, "seccomp", 7) == 0) { ret = load_and_attach(shname, data->d_buf, data->d_size); if (ret != 0) diff --git a/samples/bpf/seccomp1_kern.c b/samples/bpf/seccomp1_kern.c new file mode 100644 index 000000000000..7fcbd48fa69a --- /dev/null +++ b/samples/bpf/seccomp1_kern.c @@ -0,0 +1,17 @@ +#include +#include +#include +#include "bpf_helpers.h" +#include + +/* Returns EPERM when trying to close fd 999 */ +SEC("seccomp") +int bpf_prog1(struct seccomp_data *ctx) +{ + if (ctx->nr == __NR_close && ctx->args[0] == 999) + return SECCOMP_RET_ERRNO | EPERM; + + return SECCOMP_RET_ALLOW; +} + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/seccomp1_user.c b/samples/bpf/seccomp1_user.c new file mode 100644 index 000000000000..35b3533de711 --- /dev/null +++ b/samples/bpf/seccomp1_user.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" +#include +#include +#include +#include +#include +#include + +int main(int argc, char **argv) +{ + char filename[256]; + + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + assert(!prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER_EXTENDED, &prog_fd)); + close(111); + assert(errno == EBADF); + close(999); + assert(errno = EPERM); + + return 0; +} diff --git a/samples/bpf/seccomp2_kern.c b/samples/bpf/seccomp2_kern.c new file mode 100644 index 000000000000..38014ed41b9b --- /dev/null +++ b/samples/bpf/seccomp2_kern.c @@ -0,0 +1,24 @@ +#include +#include +#include +#include "bpf_helpers.h" +#include + +static inline int unknown(struct seccomp_data *ctx) +{ + if (ctx->args[0] % 2 == 0) + return SECCOMP_RET_KILL; + return SECCOMP_RET_LOG; +} + +/* Returns errno on sched_yield syscall */ +SEC("seccomp") +int bpf_prog1(struct seccomp_data *ctx) +{ + if (ctx->nr == __NR_sched_yield) + return SECCOMP_RET_ERRNO | EPERM; + + return SECCOMP_RET_ALLOW; +} + +char _license[] SEC("license") = "aGPL"; diff --git a/samples/bpf/seccomp2_user.c b/samples/bpf/seccomp2_user.c new file mode 100644 index 000000000000..986f70473fca --- /dev/null +++ b/samples/bpf/seccomp2_user.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define PTRACE_SECCOMP_GET_FILTER_EXTENDED 0x420e +static void tracee(void) +{ + assert(!prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); + + assert(!prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER_EXTENDED, &prog_fd)); + sched_yield(); + assert(errno == EPERM); + ptrace(PTRACE_TRACEME, 0, NULL, NULL); + kill(getpid(), SIGSTOP); +} + +int main(int argc, char **argv) +{ + struct bpf_prog_info loaded_prog_info = {}, retrieved_prog_info = {}; + char filename[256]; + __u32 info_len; + pid_t child; + int fd; + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + info_len = sizeof(loaded_prog_info); + assert(!bpf_obj_get_info_by_fd(prog_fd[0], &loaded_prog_info, + &info_len)); + + child = fork(); + if (child == 0) { + tracee(); + return 0; + } + + wait(NULL); + /* Fetches eBPF filter from traced child */ + fd = ptrace(PTRACE_SECCOMP_GET_FILTER_EXTENDED, child, 0, NULL); + kill(child, SIGKILL); + assert(fd >= 0); + info_len = sizeof(retrieved_prog_info); + assert(!bpf_obj_get_info_by_fd(fd, &retrieved_prog_info, &info_len)); + assert(retrieved_prog_info.id == loaded_prog_info.id); + + return 0; +} -- 2.14.1 From keescook at chromium.org Tue Feb 13 15:47:33 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 07:47:33 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: > This patchset enables seccomp filters to be written in eBPF. Although, > this patchset doesn't introduce much of the functionality enabled by > eBPF, it lays the ground work for it. > > It also introduces the capability to dump eBPF filters via the PTRACE > API in order to make it so that CHECKPOINT_RESTORE will be satisifed. > In the attached samples, there's an example of this. One can then use > BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, > and use that at reload time. > > The primary reason for not adding maps support in this patchset is > to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > If we have a map that the BPF program can read, it can potentially > "change" privileges after running. It seems like doing writes only > is safe, because it can be pure, and side effect free, and therefore > not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > to an agreement, this can be in a follow-up patchset. What's the reason for adding eBPF support? seccomp shouldn't need it, and it only makes the code more complex. I'd rather stick with cBPF until we have an overwhelmingly good reason to use eBPF as a "native" seccomp filter language. -Kees > > > Sargun Dhillon (3): > bpf, seccomp: Add eBPF filter capabilities > seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp > filters > bpf: Add eBPF seccomp sample programs > > arch/Kconfig | 7 ++ > include/linux/bpf_types.h | 3 + > include/linux/seccomp.h | 12 +++ > include/uapi/linux/bpf.h | 2 + > include/uapi/linux/ptrace.h | 5 +- > include/uapi/linux/seccomp.h | 15 ++-- > kernel/bpf/syscall.c | 1 + > kernel/ptrace.c | 3 + > kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++++++++++----- > samples/bpf/Makefile | 9 +++ > samples/bpf/bpf_load.c | 9 ++- > samples/bpf/seccomp1_kern.c | 17 ++++ > samples/bpf/seccomp1_user.c | 34 ++++++++ > samples/bpf/seccomp2_kern.c | 24 ++++++ > samples/bpf/seccomp2_user.c | 66 +++++++++++++++ > 15 files changed, 362 insertions(+), 30 deletions(-) > create mode 100644 samples/bpf/seccomp1_kern.c > create mode 100644 samples/bpf/seccomp1_user.c > create mode 100644 samples/bpf/seccomp2_kern.c > create mode 100644 samples/bpf/seccomp2_user.c > > -- > 2.14.1 > -- Kees Cook Pixel Security From sargun at sargun.me Tue Feb 13 16:29:26 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 13 Feb 2018 08:29:26 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook wrote: > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >> This patchset enables seccomp filters to be written in eBPF. Although, >> this patchset doesn't introduce much of the functionality enabled by >> eBPF, it lays the ground work for it. >> >> It also introduces the capability to dump eBPF filters via the PTRACE >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >> In the attached samples, there's an example of this. One can then use >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >> and use that at reload time. >> >> The primary reason for not adding maps support in this patchset is >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >> If we have a map that the BPF program can read, it can potentially >> "change" privileges after running. It seems like doing writes only >> is safe, because it can be pure, and side effect free, and therefore >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >> to an agreement, this can be in a follow-up patchset. > > What's the reason for adding eBPF support? seccomp shouldn't need it, > and it only makes the code more complex. I'd rather stick with -- cBPF > until we have an overwhelmingly good reason to use eBPF as a "native" > seccomp filter language. > > -Kees > Three reasons: 1) The userspace tooling for eBPF is much better than the user space tooling for cBPF. Our use case is specifically to optimize Docker policies. This is roughly what their seccomp policy looks like: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. It would be much nicer to be able to leverage eBPF to write this in C, or any other the other languages targetting eBPF. In addition, if we have write-only maps, we can exfiltrate information from seccomp, like arguments, and errors in a relatively cheap way compared to cBPF, and then extract this via the bcc stack. Writing cBPF via C macros is a pain, and the off the shelf cBPF libraries are getting no love. The eBPF community is *exploding* with contributions. 2) In my testing, which thus so far has been very rudimentary, with rewriting the policy that libseccomp generates from the Docker policy to use eBPF, and eBPF maps performs much better than cBPF. The specific case tested was to use a bpf array to lookup rules for a particular syscall. In a super trivial test, this was about 5% low latency than using traditional branches. If you need more evidence of this, I can work a little bit more on the maps related patches, and see if I can get some more benchmarking. From my understanding, we would need to add "sealing" support for maps, in which they can be marked as read-only, and only at that point should an eBPF seccomp program be able to read from them. 3) Eventually, I'd like to use some more advanced capabilities of eBPF, like being able to rewrite arguments safely (not things referred to by pointers, but just plain old arguments). >> >> >> Sargun Dhillon (3): >> bpf, seccomp: Add eBPF filter capabilities >> seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp >> filters >> bpf: Add eBPF seccomp sample programs >> >> arch/Kconfig | 7 ++ >> include/linux/bpf_types.h | 3 + >> include/linux/seccomp.h | 12 +++ >> include/uapi/linux/bpf.h | 2 + >> include/uapi/linux/ptrace.h | 5 +- >> include/uapi/linux/seccomp.h | 15 ++-- >> kernel/bpf/syscall.c | 1 + >> kernel/ptrace.c | 3 + >> kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++++++++++----- >> samples/bpf/Makefile | 9 +++ >> samples/bpf/bpf_load.c | 9 ++- >> samples/bpf/seccomp1_kern.c | 17 ++++ >> samples/bpf/seccomp1_user.c | 34 ++++++++ >> samples/bpf/seccomp2_kern.c | 24 ++++++ >> samples/bpf/seccomp2_user.c | 66 +++++++++++++++ >> 15 files changed, 362 insertions(+), 30 deletions(-) >> create mode 100644 samples/bpf/seccomp1_kern.c >> create mode 100644 samples/bpf/seccomp1_user.c >> create mode 100644 samples/bpf/seccomp2_kern.c >> create mode 100644 samples/bpf/seccomp2_user.c >> >> -- >> 2.14.1 >> > > > > -- > Kees Cook > Pixel Security From me at jessfraz.com Tue Feb 13 17:02:03 2018 From: me at jessfraz.com (Jessie Frazelle) Date: Tue, 13 Feb 2018 12:02:03 -0500 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon wrote: > On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook wrote: >> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >>> This patchset enables seccomp filters to be written in eBPF. Although, >>> this patchset doesn't introduce much of the functionality enabled by >>> eBPF, it lays the ground work for it. >>> >>> It also introduces the capability to dump eBPF filters via the PTRACE >>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >>> In the attached samples, there's an example of this. One can then use >>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >>> and use that at reload time. >>> >>> The primary reason for not adding maps support in this patchset is >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >>> If we have a map that the BPF program can read, it can potentially >>> "change" privileges after running. It seems like doing writes only >>> is safe, because it can be pure, and side effect free, and therefore >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >>> to an agreement, this can be in a follow-up patchset. >> >> What's the reason for adding eBPF support? seccomp shouldn't need it, >> and it only makes the code more complex. I'd rather stick with -- cBPF >> until we have an overwhelmingly good reason to use eBPF as a "native" >> seccomp filter language. >> >> -Kees >> > Three reasons: > 1) The userspace tooling for eBPF is much better than the user space > tooling for cBPF. Our use case is specifically to optimize Docker > policies. This is roughly what their seccomp policy looks like: > https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. > It would be much nicer to be able to leverage eBPF to write this in C, > or any other the other languages targetting eBPF. In addition, if we > have write-only maps, we can exfiltrate information from seccomp, like > arguments, and errors in a relatively cheap way compared to cBPF, and > then extract this via the bcc stack. Writing cBPF via C macros is a > pain, and the off the shelf cBPF libraries are getting no love. The > eBPF community is *exploding* with contributions. Is stage two of this getting runc to support eBPF and docker to change the default to be written as eBPF, because I foresee that being a problem mainly with the kernel versions people use. The point of that patch was to help the most people and as your point in (2) is made about performance, that is a trade-off I would be willing to make in order to have this functionality on more kernel versions. The other alternative would be to have docker translate to use eBPF if the kernel supported it, but that amount of complexity seems a bit unnecessary for a feature that was trying to also be "simple". Or do you plan on wrapping filters onto processes tangentially from the runtime, in which case, that should be totally fine :) Anyways this is kinda a tangent from the main point of getting it in the kernel, just I would hate to see someone having to maintain this without there being a path to getting it upstream elsewhere. > > 2) In my testing, which thus so far has been very rudimentary, with > rewriting the policy that libseccomp generates from the Docker policy > to use eBPF, and eBPF maps performs much better than cBPF. The > specific case tested was to use a bpf array to lookup rules for a > particular syscall. In a super trivial test, this was about 5% low > latency than using traditional branches. If you need more evidence of > this, I can work a little bit more on the maps related patches, and > see if I can get some more benchmarking. From my understanding, we > would need to add "sealing" support for maps, in which they can be > marked as read-only, and only at that point should an eBPF seccomp > program be able to read from them. > > 3) Eventually, I'd like to use some more advanced capabilities of > eBPF, like being able to rewrite arguments safely (not things referred > to by pointers, but just plain old arguments). > >>> >>> >>> Sargun Dhillon (3): >>> bpf, seccomp: Add eBPF filter capabilities >>> seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp >>> filters >>> bpf: Add eBPF seccomp sample programs >>> >>> arch/Kconfig | 7 ++ >>> include/linux/bpf_types.h | 3 + >>> include/linux/seccomp.h | 12 +++ >>> include/uapi/linux/bpf.h | 2 + >>> include/uapi/linux/ptrace.h | 5 +- >>> include/uapi/linux/seccomp.h | 15 ++-- >>> kernel/bpf/syscall.c | 1 + >>> kernel/ptrace.c | 3 + >>> kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++++++++++----- >>> samples/bpf/Makefile | 9 +++ >>> samples/bpf/bpf_load.c | 9 ++- >>> samples/bpf/seccomp1_kern.c | 17 ++++ >>> samples/bpf/seccomp1_user.c | 34 ++++++++ >>> samples/bpf/seccomp2_kern.c | 24 ++++++ >>> samples/bpf/seccomp2_user.c | 66 +++++++++++++++ >>> 15 files changed, 362 insertions(+), 30 deletions(-) >>> create mode 100644 samples/bpf/seccomp1_kern.c >>> create mode 100644 samples/bpf/seccomp1_user.c >>> create mode 100644 samples/bpf/seccomp2_kern.c >>> create mode 100644 samples/bpf/seccomp2_user.c >>> >>> -- >>> 2.14.1 >>> >> >> >> >> -- >> Kees Cook >> Pixel Security -- Jessie Frazelle 4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3 pgp.mit.edu From cpuguy83 at gmail.com Tue Feb 13 17:07:08 2018 From: cpuguy83 at gmail.com (Brian Goff) Date: Tue, 13 Feb 2018 12:07:08 -0500 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: Agreed. I like the idea, but we'll have to maintain backwards compat at the docker/runc level... but doesn't mean it shouldn't be added. It may just take a long time to add support. On Tue, Feb 13, 2018 at 12:02 PM, Jessie Frazelle wrote: > On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon wrote: > > On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook > wrote: > >> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon > wrote: > >>> This patchset enables seccomp filters to be written in eBPF. Although, > >>> this patchset doesn't introduce much of the functionality enabled by > >>> eBPF, it lays the ground work for it. > >>> > >>> It also introduces the capability to dump eBPF filters via the PTRACE > >>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. > >>> In the attached samples, there's an example of this. One can then use > >>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, > >>> and use that at reload time. > >>> > >>> The primary reason for not adding maps support in this patchset is > >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > >>> If we have a map that the BPF program can read, it can potentially > >>> "change" privileges after running. It seems like doing writes only > >>> is safe, because it can be pure, and side effect free, and therefore > >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > >>> to an agreement, this can be in a follow-up patchset. > >> > >> What's the reason for adding eBPF support? seccomp shouldn't need it, > >> and it only makes the code more complex. I'd rather stick with -- cBPF > >> until we have an overwhelmingly good reason to use eBPF as a "native" > >> seccomp filter language. > >> > >> -Kees > >> > > Three reasons: > > 1) The userspace tooling for eBPF is much better than the user space > > tooling for cBPF. Our use case is specifically to optimize Docker > > policies. This is roughly what their seccomp policy looks like: > > https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. > > It would be much nicer to be able to leverage eBPF to write this in C, > > or any other the other languages targetting eBPF. In addition, if we > > have write-only maps, we can exfiltrate information from seccomp, like > > arguments, and errors in a relatively cheap way compared to cBPF, and > > then extract this via the bcc stack. Writing cBPF via C macros is a > > pain, and the off the shelf cBPF libraries are getting no love. The > > eBPF community is *exploding* with contributions. > > Is stage two of this getting runc to support eBPF and docker to change > the default to be written as eBPF, because I foresee that being a > problem mainly with the kernel versions people use. The point of that > patch was to help the most people and as your point in (2) is made > about performance, that is a trade-off I would be willing to make in > order to have this functionality on more kernel versions. > > The other alternative would be to have docker translate to use eBPF if > the kernel supported it, but that amount of complexity seems a bit > unnecessary for a feature that was trying to also be "simple". > > Or do you plan on wrapping filters onto processes tangentially from > the runtime, in which case, that should be totally fine :) > > Anyways this is kinda a tangent from the main point of getting it in > the kernel, just I would hate to see someone having to maintain this > without there being a path to getting it upstream elsewhere. > > > > > 2) In my testing, which thus so far has been very rudimentary, with > > rewriting the policy that libseccomp generates from the Docker policy > > to use eBPF, and eBPF maps performs much better than cBPF. The > > specific case tested was to use a bpf array to lookup rules for a > > particular syscall. In a super trivial test, this was about 5% low > > latency than using traditional branches. If you need more evidence of > > this, I can work a little bit more on the maps related patches, and > > see if I can get some more benchmarking. From my understanding, we > > would need to add "sealing" support for maps, in which they can be > > marked as read-only, and only at that point should an eBPF seccomp > > program be able to read from them. > > > > 3) Eventually, I'd like to use some more advanced capabilities of > > eBPF, like being able to rewrite arguments safely (not things referred > > to by pointers, but just plain old arguments). > > > >>> > >>> > >>> Sargun Dhillon (3): > >>> bpf, seccomp: Add eBPF filter capabilities > >>> seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp > >>> filters > >>> bpf: Add eBPF seccomp sample programs > >>> > >>> arch/Kconfig | 7 ++ > >>> include/linux/bpf_types.h | 3 + > >>> include/linux/seccomp.h | 12 +++ > >>> include/uapi/linux/bpf.h | 2 + > >>> include/uapi/linux/ptrace.h | 5 +- > >>> include/uapi/linux/seccomp.h | 15 ++-- > >>> kernel/bpf/syscall.c | 1 + > >>> kernel/ptrace.c | 3 + > >>> kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++ > ++++++++----- > >>> samples/bpf/Makefile | 9 +++ > >>> samples/bpf/bpf_load.c | 9 ++- > >>> samples/bpf/seccomp1_kern.c | 17 ++++ > >>> samples/bpf/seccomp1_user.c | 34 ++++++++ > >>> samples/bpf/seccomp2_kern.c | 24 ++++++ > >>> samples/bpf/seccomp2_user.c | 66 +++++++++++++++ > >>> 15 files changed, 362 insertions(+), 30 deletions(-) > >>> create mode 100644 samples/bpf/seccomp1_kern.c > >>> create mode 100644 samples/bpf/seccomp1_user.c > >>> create mode 100644 samples/bpf/seccomp2_kern.c > >>> create mode 100644 samples/bpf/seccomp2_user.c > >>> > >>> -- > >>> 2.14.1 > >>> > >> > >> > >> > >> -- > >> Kees Cook > >> Pixel Security > > > > -- > > > Jessie Frazelle > 4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3 > pgp.mit.edu > _______________________________________________ > Containers mailing list > Containers at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers > -- - Brian Goff From sargun at sargun.me Tue Feb 13 17:31:42 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 13 Feb 2018 09:31:42 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle wrote: > On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon wrote: >> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook wrote: >>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >>>> This patchset enables seccomp filters to be written in eBPF. Although, >>>> this patchset doesn't introduce much of the functionality enabled by >>>> eBPF, it lays the ground work for it. >>>> >>>> It also introduces the capability to dump eBPF filters via the PTRACE >>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >>>> In the attached samples, there's an example of this. One can then use >>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >>>> and use that at reload time. >>>> >>>> The primary reason for not adding maps support in this patchset is >>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >>>> If we have a map that the BPF program can read, it can potentially >>>> "change" privileges after running. It seems like doing writes only >>>> is safe, because it can be pure, and side effect free, and therefore >>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >>>> to an agreement, this can be in a follow-up patchset. >>> >>> What's the reason for adding eBPF support? seccomp shouldn't need it, >>> and it only makes the code more complex. I'd rather stick with -- cBPF >>> until we have an overwhelmingly good reason to use eBPF as a "native" >>> seccomp filter language. >>> >>> -Kees >>> >> Three reasons: >> 1) The userspace tooling for eBPF is much better than the user space >> tooling for cBPF. Our use case is specifically to optimize Docker >> policies. This is roughly what their seccomp policy looks like: >> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. >> It would be much nicer to be able to leverage eBPF to write this in C, >> or any other the other languages targetting eBPF. In addition, if we >> have write-only maps, we can exfiltrate information from seccomp, like >> arguments, and errors in a relatively cheap way compared to cBPF, and >> then extract this via the bcc stack. Writing cBPF via C macros is a >> pain, and the off the shelf cBPF libraries are getting no love. The >> eBPF community is *exploding* with contributions. > > Is stage two of this getting runc to support eBPF and docker to change > the default to be written as eBPF, because I foresee that being a > problem mainly with the kernel versions people use. The point of that > patch was to help the most people and as your point in (2) is made > about performance, that is a trade-off I would be willing to make in > order to have this functionality on more kernel versions. > > The other alternative would be to have docker translate to use eBPF if ).> the kernel supported it, but that amount of complexity seems a bit > unnecessary for a feature that was trying to also be "simple". > > Or do you plan on wrapping filters onto processes tangentially from > the runtime, in which case, that should be totally fine :) > > Anyways this is kinda a tangent from the main point of getting it in > the kernel, just I would hate to see someone having to maintain this > without there being a path to getting it upstream elsewhere. > We (me) intend to do the work to get it into Docker / Moby / Containerd / Runc / Whatever the kids call it these days. It already has the idea of multiple security modules, like seccomp, apparmor, etc.. I can imagine that the first approach would be just to let people pass eBPF filters as code, in the same way. Afterwards, there could be more sophisticated approaches in order to transparently upgrade people's filters, and give them performance upgrades. A really naive approach is to take the JSON seccomp policy document and converting it to plain old C with switch / case statements. Then we can just push that through LLVM and we're in business. Although, for some reason, I don't think the folks will want to take a hard dep on llvm at runtime, so maybe there's some mechanism where it first tries llvm, then tries to create a eBPF application naively, and then falls back to cBPF. My primary fear with the first two approaches is that given how the policies are written today, it's not conducive to the eBPF instruction limit. Our initial approach for this internally, since we use Docker 1.13.1, and backporting this can be a bit of a pain. Docker has the ability to spawn a pid 1 in the container, and we can use that to install the seccomp filter, while leaving seccomp in the daemon off. Whenever this is ready for public consumption, we'll share. Anyway, a 5% performance gain across our fleet is an exciting proposition, and we use Docker, so it's a problem that we have to figure out anyway. >> >> 2) In my testing, which thus so far has been very rudimentary, with >> rewriting the policy that libseccomp generates from the Docker policy >> to use eBPF, and eBPF maps performs much better than cBPF. The >> specific case tested was to use a bpf array to lookup rules for a >> particular syscall. In a super trivial test, this was about 5% low >> latency than using traditional branches. If you need more evidence of >> this, I can work a little bit more on the maps related patches, and >> see if I can get some more benchmarking. From my understanding, we >> would need to add "sealing" support for maps, in which they can be >> marked as read-only, and only at that point should an eBPF seccomp >> program be able to read from them. >> >> 3) Eventually, I'd like to use some more advanced capabilities of >> eBPF, like being able to rewrite arguments safely (not things referred >> to by pointers, but just plain old arguments). >> >>>> >>>> >>>> Sargun Dhillon (3): >>>> bpf, seccomp: Add eBPF filter capabilities >>>> seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp >>>> filters >>>> bpf: Add eBPF seccomp sample programs >>>> >>>> arch/Kconfig | 7 ++ >>>> include/linux/bpf_types.h | 3 + >>>> include/linux/seccomp.h | 12 +++ >>>> include/uapi/linux/bpf.h | 2 + >>>> include/uapi/linux/ptrace.h | 5 +- >>>> include/uapi/linux/seccomp.h | 15 ++-- >>>> kernel/bpf/syscall.c | 1 + >>>> kernel/ptrace.c | 3 + >>>> kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++++++++++----- >>>> samples/bpf/Makefile | 9 +++ >>>> samples/bpf/bpf_load.c | 9 ++- >>>> samples/bpf/seccomp1_kern.c | 17 ++++ >>>> samples/bpf/seccomp1_user.c | 34 ++++++++ >>>> samples/bpf/seccomp2_kern.c | 24 ++++++ >>>> samples/bpf/seccomp2_user.c | 66 +++++++++++++++ >>>> 15 files changed, 362 insertions(+), 30 deletions(-) >>>> create mode 100644 samples/bpf/seccomp1_kern.c >>>> create mode 100644 samples/bpf/seccomp1_user.c >>>> create mode 100644 samples/bpf/seccomp2_kern.c >>>> create mode 100644 samples/bpf/seccomp2_user.c >>>> >>>> -- >>>> 2.14.1 >>>> >>> >>> >>> >>> -- >>> Kees Cook >>> Pixel Security > > > > -- > > > Jessie Frazelle > 4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3 > pgp.mit.edu From keescook at chromium.org Tue Feb 13 20:16:42 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 12:16:42 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon wrote: > On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle wrote: >> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon wrote: >>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook wrote: >>>> What's the reason for adding eBPF support? seccomp shouldn't need it, >>>> and it only makes the code more complex. I'd rather stick with -- cBPF >>>> until we have an overwhelmingly good reason to use eBPF as a "native" >>>> seccomp filter language. >>>> >>> Three reasons: >>> 1) The userspace tooling for eBPF is much better than the user space >>> tooling for cBPF. Our use case is specifically to optimize Docker >>> policies. This is roughly what their seccomp policy looks like: >>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. >>> It would be much nicer to be able to leverage eBPF to write this in C, >>> or any other the other languages targetting eBPF. In addition, if we >>> have write-only maps, we can exfiltrate information from seccomp, like >>> arguments, and errors in a relatively cheap way compared to cBPF, and >>> then extract this via the bcc stack. Writing cBPF via C macros is a >>> pain, and the off the shelf cBPF libraries are getting no love. The >>> eBPF community is *exploding* with contributions. eBPF moving quickly is a disincentive from my perspective, as I want absolutely zero surprises when it comes to seccomp. :) Given the steady stream of exploitable flaws in eBPF, I don't want seccomp anywhere near it. :( Many distros ship with the bpf() syscall disabled, for example (or entirely compiled out, as in Chrome OS and Android). The convenience of writing C for eBPF output is certainly nice, but it seems like either LLVM could grow a cBPF backend, or libseccomp could be improved to provide the needed features. Can you explain the exfiltration piece? Do you mean it would be "cheap" in the sense that the results can be stored and studied without needing a ptrace manager to catch the failures? I remain unconvinced that seccomp needs a more descriptive language, given its limited usage. > A really naive approach is to take the JSON seccomp policy document > and converting it to plain old C with switch / case statements. Then > we can just push that through LLVM and we're in business. Although, > for some reason, I don't think the folks will want to take a hard dep > on llvm at runtime, so maybe there's some mechanism where it first > tries llvm, then tries to create a eBPF application naively, and then > falls back to cBPF. My primary fear with the first two approaches is > that given how the policies are written today, it's not conducive to > the eBPF instruction limit. How about having libseccomp grow a JSON parser? >>> 2) In my testing, which thus so far has been very rudimentary, with >>> rewriting the policy that libseccomp generates from the Docker policy >>> to use eBPF, and eBPF maps performs much better than cBPF. The >>> specific case tested was to use a bpf array to lookup rules for a >>> particular syscall. In a super trivial test, this was about 5% low >>> latency than using traditional branches. If you need more evidence of >>> this, I can work a little bit more on the maps related patches, and >>> see if I can get some more benchmarking. From my understanding, we >>> would need to add "sealing" support for maps, in which they can be >>> marked as read-only, and only at that point should an eBPF seccomp >>> program be able to read from them. This came up recently on the libseccomp mailing list. The map lookup is faster than a linear search, but for large filters, the filter can be written as a balanced tree (as Chrome does), or reordered by syscall frequency (as is recommended by minijail), and that appears to get a much larger improvement than even the map lookup. >>> 3) Eventually, I'd like to use some more advanced capabilities of >>> eBPF, like being able to rewrite arguments safely (not things referred >>> to by pointers, but just plain old arguments). Much like 1), I don't find this an incentive, as the interactions become much harder to reason about, and I am concerned we'll open seccomp up to attack for a relatively small benefit. However, rewriting arguments has come up in very narrow cases, and Tycho was working on a method of doing userspace notifications (i.e. without a ptrace manager) to get us closer. If the needs Tycho outlined[1] could be addressed fully with eBPF, and we can very narrowly scope the use of the "extra" eBPF features, I might be more inclined to merge something like this, but I want to take it very carefully. Besides creating a dependency on the bpf() syscall, this would create side channels (via maps) that make me very uncomfortable when dealing with process isolation. (Though, in theory, this is already correctly constrained by no-new-privs...) Tycho, could you get what you needed from eBPF? My impression would be that you'd still need a user notification mechanism to stop the process, as the decisions about how to rewrite arguments likely cannot be fully characterized by the internal eBPF filter. -Kees [1] https://patchwork.kernel.org/patch/10199295/ -- Kees Cook Pixel Security From keescook at chromium.org Tue Feb 13 20:18:18 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 12:18:18 -0800 Subject: [PATCH net-next 3/3] bpf: Add eBPF seccomp sample programs In-Reply-To: <20180213154320.GA3319@ircssh-2.c.rugged-nimbus-611.internal> References: <20180213154320.GA3319@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 7:43 AM, Sargun Dhillon wrote: > +++ b/samples/bpf/seccomp1_kern.c > @@ -0,0 +1,17 @@ > +#include > +#include > +#include > +#include "bpf_helpers.h" > +#include > + > +/* Returns EPERM when trying to close fd 999 */ > +SEC("seccomp") > +int bpf_prog1(struct seccomp_data *ctx) > +{ > + if (ctx->nr == __NR_close && ctx->args[0] == 999) > + return SECCOMP_RET_ERRNO | EPERM; > + > + return SECCOMP_RET_ALLOW; > +} > + > +char _license[] SEC("license") = "GPL"; > [...] > +++ b/samples/bpf/seccomp2_kern.c > @@ -0,0 +1,24 @@ > +#include > +#include > +#include > +#include "bpf_helpers.h" > +#include > + > +static inline int unknown(struct seccomp_data *ctx) > +{ > + if (ctx->args[0] % 2 == 0) > + return SECCOMP_RET_KILL; > + return SECCOMP_RET_LOG; > +} > + > +/* Returns errno on sched_yield syscall */ > +SEC("seccomp") > +int bpf_prog1(struct seccomp_data *ctx) > +{ > + if (ctx->nr == __NR_sched_yield) > + return SECCOMP_RET_ERRNO | EPERM; > + > + return SECCOMP_RET_ALLOW; > +} > + > +char _license[] SEC("license") = "aGPL"; Nit: these should check architecture before syscall number. Since they're samples, people look at them for and copy them regularly, they should be as safe/correct as possible. -Kees -- Kees Cook Pixel Security From tom.hromatka at oracle.com Tue Feb 13 20:33:44 2018 From: tom.hromatka at oracle.com (Tom Hromatka) Date: Tue, 13 Feb 2018 13:33:44 -0700 Subject: [PATCH net-next 0/3] eBPF Seccomp filters Message-ID: <7eb1497e-e5f3-c5ba-e255-7f510795b51d@oracle.com> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: > This patchset enables seccomp filters to be written in eBPF. Although, > this patchset doesn't introduce much of the functionality enabled by > eBPF, it lays the ground work for it. > > It also introduces the capability to dump eBPF filters via the PTRACE > API in order to make it so that CHECKPOINT_RESTORE will be satisifed. > In the attached samples, there's an example of this. One can then use > BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, > and use that at reload time. > > The primary reason for not adding maps support in this patchset is > to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > If we have a map that the BPF program can read, it can potentially > "change" privileges after running. It seems like doing writes only > is safe, because it can be pure, and side effect free, and therefore > not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > to an agreement, this can be in a follow-up patchset. Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp userspace mailing list just last week: https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74 The kernel changes I proposed are in this email: https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ In that email thread, Kees requested that I try out a binary tree in cBPF and evaluate its performance. I just got a rough prototype working, and while not as fast as an eBPF hash map, the cBPF binary tree was a significant improvement over the linear list of ifs that are currently generated. Also, it only required changing a single function within the libseccomp libary itself. https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b Here are the results I am currently seeing using an in-house customer's seccomp filter and a simplistic test program that runs getppid() thousands of times. Test Case minimum TSC ticks to make syscall ---------------------------------------------------------------- seccomp disabled 620 getppid() at the front of 306-syscall seccomp filter 722 getppid() in middle of 306-syscall seccomp filter 1392 getppid() at the end of the 306-syscall filter 2452 seccomp using a 306-syscall-sized EBPF hash map 800 cBPF filter using a binary tree 922 Thanks. Tom From keescook at chromium.org Tue Feb 13 20:34:20 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 12:34:20 -0800 Subject: [PATCH net-next 1/3] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <20180213154255.GA3301@ircssh-2.c.rugged-nimbus-611.internal> References: <20180213154255.GA3301@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: > From: Sargun Dhillon > > This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant > to be used for seccomp filters as an alternative to cBPF filters. The > program type has relatively limited capabilities in terms of helpers, > but that can be extended later on. > > It also introduces a new mechanism to attach these filters via the > prctl and seccomp syscalls -- SECCOMP_MODE_FILTER_EXTENDED, and > SECCOMP_SET_MODE_FILTER_EXTENDED respectively. > > Signed-off-by: Sargun Dhillon > --- > arch/Kconfig | 7 ++ > include/linux/bpf_types.h | 3 + > include/uapi/linux/bpf.h | 2 + > include/uapi/linux/seccomp.h | 15 +++-- > kernel/bpf/syscall.c | 1 + > kernel/seccomp.c | 148 +++++++++++++++++++++++++++++++++++++------ > 6 files changed, 150 insertions(+), 26 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 76c0b54443b1..db675888577c 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -401,6 +401,13 @@ config SECCOMP_FILTER > > See Documentation/prctl/seccomp_filter.txt for details. > > +config SECCOMP_FILTER_EXTENDED > + bool "Extended BPF seccomp filters" > + depends on SECCOMP_FILTER && BPF_SYSCALL > + help > + Enables seccomp filters to be written in eBPF, as opposed > + to just cBPF filters. > + > config HAVE_GCC_PLUGINS > bool > help > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > index 19b8349a3809..945c65c4e461 100644 > --- a/include/linux/bpf_types.h > +++ b/include/linux/bpf_types.h > @@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) > #ifdef CONFIG_CGROUP_BPF > BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) > #endif > +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED > +BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp) > +#endif > > BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index db6bdc375126..5f96cb7ed954 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1,3 +1,4 @@ > + > /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com > * > @@ -133,6 +134,7 @@ enum bpf_prog_type { > BPF_PROG_TYPE_SOCK_OPS, > BPF_PROG_TYPE_SK_SKB, > BPF_PROG_TYPE_CGROUP_DEVICE, > + BPF_PROG_TYPE_SECCOMP, > }; > > enum bpf_attach_type { > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index 2a0bd9dd104d..7da8b39f2a6a 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -7,14 +7,17 @@ > > > /* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, ) */ > -#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ > -#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ > -#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ > +#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ > +#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ > +#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ > +#define SECCOMP_MODE_FILTER_EXTENDED 3 /* uses eBPF filter from fd */ This MODE flag isn't needed: it's still using a filter, and the interface changes should be sufficient with SECCOMP_SET_MODE_FILTER_EXTENDED below. > /* Valid operations for seccomp syscall. */ > -#define SECCOMP_SET_MODE_STRICT 0 > -#define SECCOMP_SET_MODE_FILTER 1 > -#define SECCOMP_GET_ACTION_AVAIL 2 > +#define SECCOMP_SET_MODE_STRICT 0 > +#define SECCOMP_SET_MODE_FILTER 1 > +#define SECCOMP_GET_ACTION_AVAIL 2 > +#define SECCOMP_SET_MODE_FILTER_EXTENDED 3 It seems like this should be a FILTER flag, not a syscall op change? > + > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > #define SECCOMP_FILTER_FLAG_TSYNC 1 > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index e24aa3241387..86d6ec8b916d 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1202,6 +1202,7 @@ static int bpf_prog_load(union bpf_attr *attr) > > if (type != BPF_PROG_TYPE_SOCKET_FILTER && > type != BPF_PROG_TYPE_CGROUP_SKB && > + type != BPF_PROG_TYPE_SECCOMP && > !capable(CAP_SYS_ADMIN)) > return -EPERM; So only init_ns-CAP_SYS_ADMIN would be able to use seccomp eBPF? > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 940fa408a288..b30dd25c1cb8 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -37,6 +37,7 @@ > #include > #include > #include > +#include > > /** > * struct seccomp_filter - container for seccomp BPF programs > @@ -367,17 +368,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > > BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); > > - /* > - * Installing a seccomp filter requires that the task has > - * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. > - * This avoids scenarios where unprivileged tasks can affect the > - * behavior of privileged children. > - */ > - if (!task_no_new_privs(current) && > - security_capable_noaudit(current_cred(), current_user_ns(), > - CAP_SYS_ADMIN) != 0) > - return ERR_PTR(-EACCES); > - > /* Allocate a new seccomp_filter */ > sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); > if (!sfilter) > @@ -423,6 +413,48 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED > +/** > + * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd > + * @user_filter: pointer to the user data containing an fd. > + * > + * Returns 0 on success and non-zero otherwise. > + */ > +static struct seccomp_filter * > +seccomp_prepare_extended_filter(const char __user *user_fd) > +{ > + struct seccomp_filter *sfilter; > + struct bpf_prog *fp; > + int fd; > + > + /* Fetch the fd from userspace */ > + if (get_user(fd, (int __user *)user_fd)) > + return ERR_PTR(-EFAULT); > + > + /* Allocate a new seccomp_filter */ > + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); > + if (!sfilter) > + return ERR_PTR(-ENOMEM); > + > + fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP); > + if (IS_ERR(fp)) { > + kfree(sfilter); > + return ERR_CAST(fp); > + } > + > + sfilter->prog = fp; > + refcount_set(&sfilter->usage, 1); > + > + return sfilter; > +} > +#else > +static struct seccomp_filter * > +seccomp_prepare_extended_filter(const char __user *filter_fd) > +{ > + return ERR_PTR(-EINVAL); > +} > +#endif > + > /** > * seccomp_attach_filter: validate and attach filter > * @flags: flags to change filter behavior > @@ -492,7 +524,10 @@ void get_seccomp_filter(struct task_struct *tsk) > static inline void seccomp_filter_free(struct seccomp_filter *filter) > { > if (filter) { > - bpf_prog_destroy(filter->prog); > + if (bpf_prog_was_classic(filter->prog)) > + bpf_prog_destroy(filter->prog); > + else > + bpf_prog_put(filter->prog); > kfree(filter); > } > } > @@ -842,18 +877,36 @@ static long seccomp_set_mode_strict(void) > * Returns 0 on success or -EINVAL on failure. > */ > static long seccomp_set_mode_filter(unsigned int flags, > - const char __user *filter) > + const char __user *filter, > + unsigned long filter_type) I think this can just live in flags? > { > - const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; > + /* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */ > + const unsigned long filter_mode = SECCOMP_MODE_FILTER; > struct seccomp_filter *prepared = NULL; > long ret = -EINVAL; > > /* Validate flags. */ > if (flags & ~SECCOMP_FILTER_FLAG_MASK) > return -EINVAL; > + /* > + * Installing a seccomp filter requires that the task has > + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. > + * This avoids scenarios where unprivileged tasks can affect the > + * behavior of privileged children. > + */ > + if (!task_no_new_privs(current) && > + security_capable_noaudit(current_cred(), current_user_ns(), > + CAP_SYS_ADMIN) != 0) > + return -EACCES; This changes the order of checks -- before, too-large filters would get EINVAL even if they lacked the needed permissions. As long as this doesn't break anything in the real world, it should be fine, but I might want to instead create a perm-check function and just call it in both functions. (And likely write a self-test that checks this order, if it doesn't already exist.) > > /* Prepare the new filter before holding any locks. */ > - prepared = seccomp_prepare_user_filter(filter); > + if (filter_type == SECCOMP_SET_MODE_FILTER_EXTENDED) > + prepared = seccomp_prepare_extended_filter(filter); > + else if (filter_type == SECCOMP_SET_MODE_FILTER) > + prepared = seccomp_prepare_user_filter(filter); > + else > + return -EINVAL; > + > if (IS_ERR(prepared)) > return PTR_ERR(prepared); > > @@ -867,7 +920,7 @@ static long seccomp_set_mode_filter(unsigned int flags, > > spin_lock_irq(¤t->sighand->siglock); > > - if (!seccomp_may_assign_mode(seccomp_mode)) > + if (!seccomp_may_assign_mode(filter_mode)) > goto out; > > ret = seccomp_attach_filter(flags, prepared); > @@ -876,7 +929,7 @@ static long seccomp_set_mode_filter(unsigned int flags, > /* Do not free the successfully attached filter. */ > prepared = NULL; > > - seccomp_assign_mode(current, seccomp_mode); > + seccomp_assign_mode(current, filter_mode); With a filter flag, the above hunks don't need to be changed, for example. > out: > spin_unlock_irq(¤t->sighand->siglock); > if (flags & SECCOMP_FILTER_FLAG_TSYNC) > @@ -926,7 +979,9 @@ static long do_seccomp(unsigned int op, unsigned int flags, > return -EINVAL; > return seccomp_set_mode_strict(); > case SECCOMP_SET_MODE_FILTER: > - return seccomp_set_mode_filter(flags, uargs); > + return seccomp_set_mode_filter(flags, uargs, op); > + case SECCOMP_SET_MODE_FILTER_EXTENDED: > + return seccomp_set_mode_filter(flags, uargs, op); And this isn't needed, since it would be passed as a flag. > case SECCOMP_GET_ACTION_AVAIL: > if (flags != 0) > return -EINVAL; > @@ -969,6 +1024,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) > op = SECCOMP_SET_MODE_FILTER; > uargs = filter; > break; > + case SECCOMP_MODE_FILTER_EXTENDED: > + op = SECCOMP_SET_MODE_FILTER_EXTENDED; > + uargs = filter; > + break; Same. > default: > return -EINVAL; > } > @@ -1040,8 +1099,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, > if (IS_ERR(filter)) > return PTR_ERR(filter); > > - fprog = filter->prog->orig_prog; > - if (!fprog) { > + if (!bpf_prog_was_classic(filter->prog)) { > /* This must be a new non-cBPF filter, since we save > * every cBPF filter's orig_prog above when > * CONFIG_CHECKPOINT_RESTORE is enabled. > @@ -1050,6 +1108,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, > goto out; > } > > + fprog = filter->prog->orig_prog; I wonder if it would be easier to review to split eBPF install from the eBPF "get filter" changes as separate patches? > ret = fprog->len; > if (!data) > goto out; > @@ -1239,6 +1298,55 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, > return 0; > } > > +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED > +static bool seccomp_is_valid_access(int off, int size, > + enum bpf_access_type type, > + struct bpf_insn_access_aux *info) > +{ > + if (type != BPF_READ) > + return false; > + > + if (off < 0 || off + size > sizeof(struct seccomp_data)) > + return false; > + > + switch (off) { > + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): > + return (size == sizeof(__u64)); > + case bpf_ctx_range(struct seccomp_data, nr): > + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); > + case bpf_ctx_range(struct seccomp_data, arch): > + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); > + case bpf_ctx_range(struct seccomp_data, instruction_pointer): > + return (size == FIELD_SIZEOF(struct seccomp_data, > + instruction_pointer)); > + } > + > + return false; > +} > + > +static const struct bpf_func_proto * > +seccomp_func_proto(enum bpf_func_id func_id) > +{ > + switch (func_id) { > + case BPF_FUNC_get_current_uid_gid: > + return &bpf_get_current_uid_gid_proto; > + case BPF_FUNC_trace_printk: > + if (capable(CAP_SYS_ADMIN)) > + return bpf_get_trace_printk_proto(); > + default: > + return NULL; > + } > +} This makes me so uncomfortable. :) Why is uid/gid needed? Why add printk support here? (And why is it CAP_SYS_ADMIN checked if the entire filter is CAP_SYS_ADMIN checked before being attached?) > + > +const struct bpf_prog_ops seccomp_prog_ops = { > +}; > + > +const struct bpf_verifier_ops seccomp_verifier_ops = { > + .get_func_proto = seccomp_func_proto, > + .is_valid_access = seccomp_is_valid_access, > +}; > +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ > + > static struct ctl_path seccomp_sysctl_path[] = { > { .procname = "kernel", }, > { .procname = "seccomp", }, > -- > 2.14.1 > -Kees -- Kees Cook Pixel Security From keescook at chromium.org Tue Feb 13 20:35:46 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 12:35:46 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <7eb1497e-e5f3-c5ba-e255-7f510795b51d@oracle.com> References: <7eb1497e-e5f3-c5ba-e255-7f510795b51d@oracle.com> Message-ID: On Tue, Feb 13, 2018 at 12:33 PM, Tom Hromatka wrote: > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >> >> This patchset enables seccomp filters to be written in eBPF. Although, >> this patchset doesn't introduce much of the functionality enabled by >> eBPF, it lays the ground work for it. >> >> It also introduces the capability to dump eBPF filters via the PTRACE >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >> In the attached samples, there's an example of this. One can then use >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >> and use that at reload time. >> >> The primary reason for not adding maps support in this patchset is >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >> If we have a map that the BPF program can read, it can potentially >> "change" privileges after running. It seems like doing writes only >> is safe, because it can be pure, and side effect free, and therefore >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >> to an agreement, this can be in a follow-up patchset. > > > > Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp > userspace mailing list just last week: > https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74 > > The kernel changes I proposed are in this email: > https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ > > In that email thread, Kees requested that I try out a binary tree in cBPF > and evaluate its performance. I just got a rough prototype working, and > while not as fast as an eBPF hash map, the cBPF binary tree was a > significant > improvement over the linear list of ifs that are currently generated. Also, > it only required changing a single function within the libseccomp libary > itself. > > https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b > > Here are the results I am currently seeing using an in-house customer's > seccomp filter and a simplistic test program that runs getppid() thousands > of times. > > Test Case minimum TSC ticks to make syscall > ---------------------------------------------------------------- > seccomp disabled 620 > getppid() at the front of 306-syscall seccomp filter 722 > getppid() in middle of 306-syscall seccomp filter 1392 > getppid() at the end of the 306-syscall filter 2452 > seccomp using a 306-syscall-sized EBPF hash map 800 > cBPF filter using a binary tree 922 I still think that's a crazy filter. :) It should be inverted to just check the 26 syscalls and a final "greater than" test. I would expect it to be faster still. :) -Kees -- Kees Cook Pixel Security From tom.hromatka at oracle.com Tue Feb 13 20:38:53 2018 From: tom.hromatka at oracle.com (Tom Hromatka) Date: Tue, 13 Feb 2018 13:38:53 -0700 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <7eb1497e-e5f3-c5ba-e255-7f510795b51d@oracle.com> Message-ID: On 02/13/2018 01:35 PM, Kees Cook wrote: > On Tue, Feb 13, 2018 at 12:33 PM, Tom Hromatka wrote: >> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >>> This patchset enables seccomp filters to be written in eBPF. Although, >>> this patchset doesn't introduce much of the functionality enabled by >>> eBPF, it lays the ground work for it. >>> >>> It also introduces the capability to dump eBPF filters via the PTRACE >>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >>> In the attached samples, there's an example of this. One can then use >>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >>> and use that at reload time. >>> >>> The primary reason for not adding maps support in this patchset is >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >>> If we have a map that the BPF program can read, it can potentially >>> "change" privileges after running. It seems like doing writes only >>> is safe, because it can be pure, and side effect free, and therefore >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >>> to an agreement, this can be in a follow-up patchset. >> >> >> Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp >> userspace mailing list just last week: >> https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74 >> >> The kernel changes I proposed are in this email: >> https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ >> >> In that email thread, Kees requested that I try out a binary tree in cBPF >> and evaluate its performance. I just got a rough prototype working, and >> while not as fast as an eBPF hash map, the cBPF binary tree was a >> significant >> improvement over the linear list of ifs that are currently generated. Also, >> it only required changing a single function within the libseccomp libary >> itself. >> >> https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b >> >> Here are the results I am currently seeing using an in-house customer's >> seccomp filter and a simplistic test program that runs getppid() thousands >> of times. >> >> Test Case minimum TSC ticks to make syscall >> ---------------------------------------------------------------- >> seccomp disabled 620 >> getppid() at the front of 306-syscall seccomp filter 722 >> getppid() in middle of 306-syscall seccomp filter 1392 >> getppid() at the end of the 306-syscall filter 2452 >> seccomp using a 306-syscall-sized EBPF hash map 800 >> cBPF filter using a binary tree 922 > I still think that's a crazy filter. :) It should be inverted to just > check the 26 syscalls and a final "greater than" test. I would expect > it to be faster still. :) > > -Kees I completely agree it's a crazy filter, but it seems to be a common "mistake" our users are making.? It would be nice to help them out if we can. Tom From tycho at tycho.ws Tue Feb 13 20:50:40 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Tue, 13 Feb 2018 13:50:40 -0700 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: <20180213205040.h3nsvz5go6lewvuy@smitten> On Tue, Feb 13, 2018 at 12:16:42PM -0800, Kees Cook wrote: > If the needs Tycho outlined[1] could be addressed fully with eBPF, and > we can very narrowly scope the use of the "extra" eBPF features, I > might be more inclined to merge something like this, but I want to > take it very carefully. Besides creating a dependency on the bpf() > syscall, this would create side channels (via maps) that make me very > uncomfortable when dealing with process isolation. (Though, in theory, > this is already correctly constrained by no-new-privs...) > > Tycho, could you get what you needed from eBPF? We could get almost all the way there, I think. We could pass the event via a bpf map, and then have a userspace daemon do: while (1) { bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); if (!syscall_queued(&attr)) continue; do_stuff(&attr); set_done(&attr); bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } but as you say, > My impression would be that you'd still need a user notification > mechanism to stop the process, as the decisions about how to rewrite > arguments likely cannot be fully characterized by the internal eBPF > filter. ...there's no way to stop the seccomp'd task until userspace is finished with whatever thing it needs to do on behalf of the seccomp'd task (at least, IIUC). That's of course ignoring the ergonomics from userspace: bpf_map_fops doesn't implement poll() or anything, so we really do have to use a while(1), if we want to allow more than one syscall queuing at a time, we need to poll multiple map elements. One of the extensions I had been considering floating for v2 of my set was allowing users to pass fds back across (again, to make userspace ergonomics a little better), which would be impossible via ebpf. Cheers, Tycho From pmoore at redhat.com Tue Feb 13 21:08:13 2018 From: pmoore at redhat.com (Paul Moore) Date: Tue, 13 Feb 2018 16:08:13 -0500 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 3:16 PM, Kees Cook wrote: > On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon wrote: >> On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle wrote: >>> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon wrote: >>>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook wrote: >>>>> What's the reason for adding eBPF support? seccomp shouldn't need it, >>>>> and it only makes the code more complex. I'd rather stick with -- cBPF >>>>> until we have an overwhelmingly good reason to use eBPF as a "native" >>>>> seccomp filter language. >>>>> >>>> Three reasons: >>>> 1) The userspace tooling for eBPF is much better than the user space >>>> tooling for cBPF. Our use case is specifically to optimize Docker >>>> policies. This is roughly what their seccomp policy looks like: >>>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. >>>> It would be much nicer to be able to leverage eBPF to write this in C, >>>> or any other the other languages targetting eBPF. In addition, if we >>>> have write-only maps, we can exfiltrate information from seccomp, like >>>> arguments, and errors in a relatively cheap way compared to cBPF, and >>>> then extract this via the bcc stack. Writing cBPF via C macros is a >>>> pain, and the off the shelf cBPF libraries are getting no love. What do you mean "no love"? I would consider libseccomp is a cBPF library and it is actively maintained/developed. >>>> The eBPF community is *exploding* with contributions. > > eBPF moving quickly is a disincentive from my perspective, as I want > absolutely zero surprises when it comes to seccomp. :) Given the > steady stream of exploitable flaws in eBPF, I don't want seccomp > anywhere near it. :( Many distros ship with the bpf() syscall > disabled, for example (or entirely compiled out, as in Chrome OS and > Android). > > The convenience of writing C for eBPF output is certainly nice, but it > seems like either LLVM could grow a cBPF backend, or libseccomp could > be improved to provide the needed features. I'm always happy to discuss adding new functionality to libseccomp; feel free to use the GH issue tracker or the libseccomp mailing list. > Can you explain the exfiltration piece? Do you mean it would be > "cheap" in the sense that the results can be stored and studied > without needing a ptrace manager to catch the failures? I'm a little confused about this piece too. > I remain unconvinced that seccomp needs a more descriptive language, > given its limited usage. FWIW, I haven't yet seen a functionality request for libseccomp that couldn't be addressed with cBPF and some creativity. >> A really naive approach is to take the JSON seccomp policy document >> and converting it to plain old C with switch / case statements. Then >> we can just push that through LLVM and we're in business. Although, >> for some reason, I don't think the folks will want to take a hard dep >> on llvm at runtime, so maybe there's some mechanism where it first >> tries llvm, then tries to create a eBPF application naively, and then >> falls back to cBPF. My primary fear with the first two approaches is >> that given how the policies are written today, it's not conducive to >> the eBPF instruction limit. > > How about having libseccomp grow a JSON parser? Generally my opinion is that seccomp filter configuration file formats are best left to the calling application, not libseccomp. This way the seccomp filter configuration can be consistent with the rest of the application's configuration. However, if someone really wants to work on this, I'm not sure I would say "no". >>>> 2) In my testing, which thus so far has been very rudimentary, with >>>> rewriting the policy that libseccomp generates from the Docker policy >>>> to use eBPF, and eBPF maps performs much better than cBPF. The >>>> specific case tested was to use a bpf array to lookup rules for a >>>> particular syscall. In a super trivial test, this was about 5% low >>>> latency than using traditional branches. If you need more evidence of >>>> this, I can work a little bit more on the maps related patches, and >>>> see if I can get some more benchmarking. From my understanding, we >>>> would need to add "sealing" support for maps, in which they can be >>>> marked as read-only, and only at that point should an eBPF seccomp >>>> program be able to read from them. > > This came up recently on the libseccomp mailing list. The map lookup > is faster than a linear search, but for large filters, the filter can > be written as a balanced tree (as Chrome does), or reordered by > syscall frequency (as is recommended by minijail), and that appears to > get a much larger improvement than even the map lookup. For reference, the current libseccomp approach is to put the shorter rules near the top of the filter (e.g. syscall only) with the longer rules (e.g. syscall + arguments) towards the end. The libseccomp API does allow for callers to influence the ordering via syscall priority hints. Someone is currently looking a tree-based ordering of syscalls for libseccomp, and I'm always open to new/better ideas. -- paul moore security @ redhat From keescook at chromium.org Tue Feb 13 21:09:20 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 13:09:20 -0800 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: <20180204104946.25559-2-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> Message-ID: On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > This patch introduces a means for syscalls matched in seccomp to notify > some other task that a particular filter has been triggered. > > The motivation for this is primarily for use with containers. For example, > if a container does an init_module(), we obviously don't want to load this > untrusted code, which may be compiled for the wrong version of the kernel > anyway. Instead, we could parse the module image, figure out which module > the container is trying to load and load it on the host. > > As another example, containers cannot mknod(), since this checks > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > coding some whitelist in the kernel. Another example is mount(), which has > many security restrictions for good reason, but configuration or runtime > knowledge could potentially be used to relax these restrictions. Related to the eBPF seccomp thread, can the logic for these things be handled entirely by eBPF? My assumption is that you still need to stop the process to do something (i.e. do a mknod, or a mount) before letting it continue. Is there some "wait for notification" system in eBPF? > This patch adds functionality that is already possible via at least two > other means that I know about, both of which involve ptrace(): first, one > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. > Unfortunately this is slow, so a faster version would be to install a > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. > Since ptrace allows only one tracer, if the container runtime is that > tracer, users inside the container (or outside) trying to debug it will not > be able to use ptrace, which is annoying. It also means that older > distributions based on Upstart cannot boot inside containers using ptrace, > since upstart itself uses ptrace to start services. Agreed: notification is extremely painful right now. The container case is compelling, since it will always want a way to trick out these kinds of filesystem calls. > The actual implementation of this is fairly small, although getting the > synchronization right was/is slightly complex. Also worth noting that there > is one race still present: > > 1. a task does a SECCOMP_RET_USER_NOTIF > 2. the userspace handler reads this notification > 3. the task dies > 4. a new task with the same pid starts > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > that the previous one did > 6. the userspace handler writes a response > > There's no way to distinguish this case right now. Maybe we care, maybe we > don't, but it's worth noting. So, I'd like to avoid the cookie if possible (surprise). Why isn't it possible to close the kernel-side of the fd to indicate that it lost the pid it was attached to? Is this just that the reader has no idea who is sending messages? So the risk is a fork/die loop within the same process tree (i.e. attached to the same filter)? Hrmpf. I can't think of a better way to handle the one(fd)-to-many(task-with-that-filter-attached) situation... > Right now the interface is a simple structure copy across a file > descriptor. We could potentially invent something fancier. I wonder if this communication should be netlink, which gives a more well-structured way to describe what's on the wire? The reason I ask is because if we ever change the seccomp_data structure, we'll now have two places where we need to deal with it (the first being within the BPF itself). My initial idea was to prefix the communication with a size field, then send the structure, and then I had nightmares, and realized this was basically netlink reinvented. > Finally, it's worth noting that the classic seccomp TOCTOU of reading > memory data from the task still applies here, but can be avoided with > careful design of the userspace handler: if the userspace handler reads all > of the task memory that is necessary before applying its security policy, > the tracee's subsequent memory edits will not be read by the tracer. Is this really true? Couldn't a multi-threaded process muck with memory out from under both the manager and the stopped process? > Signed-off-by: Tycho Andersen > CC: Kees Cook > CC: Andy Lutomirski > CC: Oleg Nesterov > CC: Eric W. Biederman > CC: "Serge E. Hallyn" > CC: Christian Brauner > CC: Tyler Hicks > CC: Akihiro Suda > --- > arch/Kconfig | 7 + > include/linux/seccomp.h | 3 +- > include/uapi/linux/seccomp.h | 18 +- > kernel/seccomp.c | 366 +++++++++++++++++++++++++- > tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++- > 5 files changed, 502 insertions(+), 6 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 400b9e1b2f27..2946cb6fd704 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -387,6 +387,13 @@ config SECCOMP_FILTER > > See Documentation/prctl/seccomp_filter.txt for details. > > +config SECCOMP_USER_NOTIFICATION > + bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action" > + depends on SECCOMP_FILTER > + help > + Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp > + programs to notify a userspace listener that a particular event happened. > + > config HAVE_GCC_PLUGINS > bool > help > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 10f25f7e4304..ce07da2ffd53 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -5,7 +5,8 @@ > #include > > #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ > - SECCOMP_FILTER_FLAG_LOG) > + SECCOMP_FILTER_FLAG_LOG | \ > + SECCOMP_FILTER_FLAG_GET_LISTENER) > > #ifdef CONFIG_SECCOMP > > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index 2a0bd9dd104d..4a342aa2e524 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -17,8 +17,9 @@ > #define SECCOMP_GET_ACTION_AVAIL 2 > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > -#define SECCOMP_FILTER_FLAG_TSYNC 1 > -#define SECCOMP_FILTER_FLAG_LOG 2 > +#define SECCOMP_FILTER_FLAG_TSYNC 1 > +#define SECCOMP_FILTER_FLAG_LOG 2 > +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 > > /* > * All BPF programs must return a 32-bit value. > @@ -34,6 +35,7 @@ > #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD > #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ > #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ > +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ > #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ /me tries to come up with an ordering rationale here and fails. An ERRNO filter would block a USER_NOTIF because it's unconditional. TRACE could be either, USER_NOTIF could be either. This means TRACE rules would be bumped by a USER_NOTIF... hmm. > #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ > #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ > @@ -59,4 +61,16 @@ struct seccomp_data { > __u64 args[6]; > }; > > +struct seccomp_notif { > + __u32 id; > + pid_t pid; > + struct seccomp_data data; > +}; > + > +struct seccomp_notif_resp { > + __u32 id; > + int error; > + long val; > +}; > + > #endif /* _UAPI_LINUX_SECCOMP_H */ > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 5f0dfb2abb8d..9541eb379e74 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -38,6 +38,52 @@ > #include > #include > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION I wonder if it's time to split up seccomp.c ... probably not, but I've always been unhappy with the #ifdefs even for just regular _FILTER. ;) > +#include > +#include > + > +enum notify_state { > + SECCOMP_NOTIFY_INIT, > + SECCOMP_NOTIFY_READ, > + SECCOMP_NOTIFY_WRITE, > +}; > + > +struct seccomp_knotif { > + /* The pid whose filter triggered the notification */ > + pid_t pid; > + > + /* > + * The "cookie" for this request; this is unique for this filter. > + */ > + u32 id; > + > + /* > + * The seccomp data. This pointer is valid the entire time this > + * notification is active, since it comes from __seccomp_filter which > + * eclipses the entire lifecycle here. > + */ > + const struct seccomp_data *data; > + > + /* > + * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not > + * yet been sent to userspace > + * SECCOMP_NOTIFY_READ: sent to userspace but no response yet > + * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has > + * not yet been written back to the application > + */ > + enum notify_state state; > + > + /* The return values, only valid when in SECCOMP_NOTIFY_WRITE */ > + int error; > + long val; > + > + /* Signals when this has entered SECCOMP_NOTIFY_WRITE */ > + struct completion ready; > + > + struct list_head list; > +}; > +#endif > + > /** > * struct seccomp_filter - container for seccomp BPF programs > * > @@ -64,6 +110,30 @@ struct seccomp_filter { > bool log; > struct seccomp_filter *prev; > struct bpf_prog *prog; > + > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > + /* > + * A semaphore that users of this notification can wait on for > + * changes. Actual reads and writes are still controlled with > + * filter->notify_lock. > + */ > + struct semaphore request; > + > + /* > + * A lock for all notification-related accesses. > + */ > + struct mutex notify_lock; > + > + /* > + * Is there currently an attached listener? > + */ > + bool has_listener; > + > + /* > + * A list of struct seccomp_knotif elements. > + */ Nit: these 3 above can be one-line comments. > + struct list_head notifications; > +#endif > }; > > /* Limit any path through the tree to 256KB worth of instructions. */ > @@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > if (!sfilter) > return ERR_PTR(-ENOMEM); > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > + mutex_init(&sfilter->notify_lock); > + sema_init(&sfilter->request, 0); > + INIT_LIST_HEAD(&sfilter->notifications); > +#endif > + > ret = bpf_prog_create_from_user(&sfilter->prog, fprog, > seccomp_check_filter, save_orig); > if (ret < 0) { > @@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason) > #define SECCOMP_LOG_TRACE (1 << 4) > #define SECCOMP_LOG_LOG (1 << 5) > #define SECCOMP_LOG_ALLOW (1 << 6) > +#define SECCOMP_LOG_USER_NOTIF (1 << 7) > > static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS | > SECCOMP_LOG_KILL_THREAD | > SECCOMP_LOG_TRAP | > SECCOMP_LOG_ERRNO | > SECCOMP_LOG_TRACE | > - SECCOMP_LOG_LOG; > + SECCOMP_LOG_LOG | > + SECCOMP_LOG_USER_NOTIF; > > static inline void seccomp_log(unsigned long syscall, long signr, u32 action, > bool requested) > @@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action, > case SECCOMP_RET_TRACE: > log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE; > break; > + case SECCOMP_RET_USER_NOTIF: > + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF; > + break; > case SECCOMP_RET_LOG: > log = seccomp_actions_logged & SECCOMP_LOG_LOG; > break; > @@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall) > } > #else > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +/* > + * Finds the next unique notification id. > + */ > +static u32 seccomp_next_notify_id(struct list_head *list) > +{ > + struct seccomp_knotif *knotif = NULL; > + struct list_head *cur; > + u32 id = get_random_u32(); > + > +again: > + list_for_each(cur, list) { > + knotif = list_entry(cur, struct seccomp_knotif, list); > + > + if (knotif->id == id) { > + id = get_random_u32(); > + goto again; > + } > + } > + > + return id; > +} > + > +static void seccomp_do_user_notification(int this_syscall, > + struct seccomp_filter *match, > + const struct seccomp_data *sd) > +{ > + int err; > + long ret = 0; > + struct seccomp_knotif n = {}; > + > + mutex_lock(&match->notify_lock); > + if (!match->has_listener) { > + err = -ENOSYS; > + goto out; > + } > + > + n.pid = current->pid; > + n.state = SECCOMP_NOTIFY_INIT; > + n.data = sd; > + n.id = seccomp_next_notify_id(&match->notifications); > + init_completion(&n.ready); > + > + list_add(&n.list, &match->notifications); > + > + mutex_unlock(&match->notify_lock); > + up(&match->request); > + > + err = wait_for_completion_interruptible(&n.ready); > + /* > + * This syscall is getting interrupted. We no longer need to > + * tell userspace about it, and any userspace responses should > + * be ignored. > + */ > + mutex_lock(&match->notify_lock); > + if (err < 0) > + goto remove_list; > + > + ret = n.val; > + err = n.error; > + > + WARN(n.state != SECCOMP_NOTIFY_WRITE, > + "notified about write complete when state is not write"); > + > +remove_list: > + list_del(&n.list); > +out: > + mutex_unlock(&match->notify_lock); > + syscall_set_return_value(current, task_pt_regs(current), > + err, ret); > +} > +#else > +static void seccomp_do_user_notification(int this_syscall, > + u32 action, > + struct seccomp_filter *match, > + const struct seccomp_data *sd) > +{ > + WARN(1, "user notification received, but disabled"); > + seccomp_log(this_syscall, SIGSYS, action, true); > + do_exit(SIGSYS); > +} > +#endif > + > #ifdef CONFIG_SECCOMP_FILTER > static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > const bool recheck_after_trace) > @@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > > return 0; > > + case SECCOMP_RET_USER_NOTIF: > + seccomp_do_user_notification(this_syscall, match, sd); > + goto skip; > case SECCOMP_RET_LOG: > seccomp_log(this_syscall, 0, action, true); > return 0; > @@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void) > } > > #ifdef CONFIG_SECCOMP_FILTER > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +static struct file *init_listener(struct seccomp_filter *filter); > +#endif > + > /** > * seccomp_set_mode_filter: internal function for setting seccomp filter > * @flags: flags to change filter behavior > @@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags, > const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; > struct seccomp_filter *prepared = NULL; > long ret = -EINVAL; > + int listener = 0; > + struct file *listener_f = NULL; > > /* Validate flags. */ > if (flags & ~SECCOMP_FILTER_FLAG_MASK) > @@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags, > if (IS_ERR(prepared)) > return PTR_ERR(prepared); > > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > + listener = get_unused_fd_flags(O_RDWR); > + if (listener < 0) { > + ret = listener; > + goto out_free; > + } > + > + listener_f = init_listener(prepared); > + if (IS_ERR(listener_f)) { > + put_unused_fd(listener); > + ret = PTR_ERR(listener_f); > + goto out_free; > + } > + } > + > /* > * Make sure we cannot change seccomp or nnp state via TSYNC > * while another thread is in the middle of calling exec. > */ > if (flags & SECCOMP_FILTER_FLAG_TSYNC && > mutex_lock_killable(¤t->signal->cred_guard_mutex)) > - goto out_free; > + goto out_put_fd; > > spin_lock_irq(¤t->sighand->siglock); > > @@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags, > spin_unlock_irq(¤t->sighand->siglock); > if (flags & SECCOMP_FILTER_FLAG_TSYNC) > mutex_unlock(¤t->signal->cred_guard_mutex); > +out_put_fd: > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > + if (ret < 0) { > + fput(listener_f); > + put_unused_fd(listener); > + } else { > + fd_install(listener, listener_f); > + ret = listener; > + } > + } > out_free: > seccomp_filter_free(prepared); > return ret; > @@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction) > case SECCOMP_RET_LOG: > case SECCOMP_RET_ALLOW: > break; > + case SECCOMP_RET_USER_NOTIF: > + if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION)) > + break; > default: > return -EOPNOTSUPP; > } > @@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, > #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread" > #define SECCOMP_RET_TRAP_NAME "trap" > #define SECCOMP_RET_ERRNO_NAME "errno" > +#define SECCOMP_RET_USER_NOTIF_NAME "user_notif" > #define SECCOMP_RET_TRACE_NAME "trace" > #define SECCOMP_RET_LOG_NAME "log" > #define SECCOMP_RET_ALLOW_NAME "allow" > @@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] = > SECCOMP_RET_KILL_THREAD_NAME " " > SECCOMP_RET_TRAP_NAME " " > SECCOMP_RET_ERRNO_NAME " " > + SECCOMP_RET_USER_NOTIF_NAME " " > SECCOMP_RET_TRACE_NAME " " > SECCOMP_RET_LOG_NAME " " > SECCOMP_RET_ALLOW_NAME; > @@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = { > { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME }, > { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME }, > { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME }, > + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME }, > { } > }; > > @@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void) > device_initcall(seccomp_sysctl_init) > > #endif /* CONFIG_SYSCTL */ > + > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +static int seccomp_notify_release(struct inode *inode, struct file *file) > +{ > + struct seccomp_filter *filter = file->private_data; > + struct list_head *cur; > + > + mutex_lock(&filter->notify_lock); > + > + /* > + * If this file is being closed because e.g. the task who owned it > + * died, let's wake everyone up who was waiting on us. > + */ > + list_for_each(cur, &filter->notifications) { > + struct seccomp_knotif *knotif; > + > + knotif = list_entry(cur, struct seccomp_knotif, list); > + > + knotif->state = SECCOMP_NOTIFY_WRITE; > + knotif->error = -ENOSYS; > + knotif->val = 0; > + complete(&knotif->ready); > + } > + > + filter->has_listener = false; > + mutex_unlock(&filter->notify_lock); > + __put_seccomp_filter(filter); > + return 0; > +} > + > +static ssize_t seccomp_notify_read(struct file *f, char __user *buf, > + size_t size, loff_t *ppos) > +{ > + struct seccomp_filter *filter = f->private_data; > + struct seccomp_knotif *knotif = NULL; > + struct seccomp_notif unotif; > + struct list_head *cur; > + ssize_t ret; > + > + /* No offset reads. */ > + if (*ppos != 0) > + return -EINVAL; > + > + ret = down_interruptible(&filter->request); > + if (ret < 0) > + return ret; > + > + mutex_lock(&filter->notify_lock); > + list_for_each(cur, &filter->notifications) { > + knotif = list_entry(cur, struct seccomp_knotif, list); > + if (knotif->state == SECCOMP_NOTIFY_INIT) > + break; > + } > + > + /* > + * We didn't find anything which is odd, because at least one > + * thing should have been queued. > + */ > + if (knotif->state != SECCOMP_NOTIFY_INIT) { > + ret = -ENOENT; > + WARN(1, "no seccomp notification found"); I tend to prefer WARN_ONCE, just in case this ever finds itself exposed to being triggered trivially from userspace. > + goto out; > + } > + > + unotif.id = knotif->id; > + unotif.pid = knotif->pid; > + unotif.data = *(knotif->data); > + > + size = min_t(size_t, size, sizeof(struct seccomp_notif)); > + if (copy_to_user(buf, &unotif, size)) { > + ret = -EFAULT; > + goto out; > + } > + > + ret = sizeof(unotif); > + knotif->state = SECCOMP_NOTIFY_READ; > + > +out: > + mutex_unlock(&filter->notify_lock); > + return ret; > +} > + > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, > + size_t size, loff_t *ppos) > +{ > + struct seccomp_filter *filter = file->private_data; > + struct seccomp_notif_resp resp = {}; > + struct seccomp_knotif *knotif = NULL; > + struct list_head *cur; > + ssize_t ret = -EINVAL; > + > + /* No partial writes. */ > + if (*ppos != 0) > + return -EINVAL; > + > + size = min_t(size_t, size, sizeof(resp)); In this case, we can't use min_t, size _must_ be == sizeof(resp), otherwise we're operating on what's in the stack (which is zeroed, but still). > + if (copy_from_user(&resp, buf, size)) > + return -EFAULT; > + > + ret = mutex_lock_interruptible(&filter->notify_lock); > + if (ret < 0) > + return ret; > + > + list_for_each(cur, &filter->notifications) { > + knotif = list_entry(cur, struct seccomp_knotif, list); > + > + if (knotif->id == resp.id) > + break; So we're finding the matching id here. Now, I'm trying to think about how this will look in real-world use: the pid will be _blocked_ while this happening. And all the other pids that trip this filter will _also_ be blocked, since they're all waiting for the reader to read and respond. The risk is pid death while waiting, and having another appear with the same pid, trigger the same filter, get blocked, and then the reader replies for the old pid, and the new pid gets the results? Since this notification queue is already linear, can't we use ordering to enforce this? i.e. only the pid at the head of the filter notification queue is going to have anything happening to it. Or is the idea to have multiple readers/writers of the fd? > + } > + > + if (!knotif || knotif->id != resp.id) { > + ret = -EINVAL; > + goto out; > + } > + > + ret = size; > + knotif->state = SECCOMP_NOTIFY_WRITE; > + knotif->error = resp.error; > + knotif->val = resp.val; > + complete(&knotif->ready); > +out: > + mutex_unlock(&filter->notify_lock); > + return ret; > +} > + > +static const struct file_operations seccomp_notify_ops = { > + .read = seccomp_notify_read, > + .write = seccomp_notify_write, > + /* TODO: poll */ What's needed for poll? I think you've got all the pieces you need already, i.e. wait queue, notifications, etc. > + .release = seccomp_notify_release, > +}; > + > +static struct file *init_listener(struct seccomp_filter *filter) > +{ > + struct file *ret; > + > + mutex_lock(&filter->notify_lock); > + if (filter->has_listener) { > + mutex_unlock(&filter->notify_lock); > + return ERR_PTR(-EBUSY); > + } > + > + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, > + filter, O_RDWR); > + if (IS_ERR(ret)) { > + __put_seccomp_filter(filter); > + } else { > + /* > + * Intentionally don't put_seccomp_filter(). The file > + * has a reference to it now. > + */ > + filter->has_listener = true; > + } I spent some time staring at this, and I don't see it: where is the get_() for this? The caller of init_listener() already does a put() on the failure path. It seems like there is a get() missing near the start of init_listener(), or I've entirely missed something. (Regardless, I think the usage counting need a comment somewhere, maybe near the top of seccomp.c with the field?) > + > + mutex_unlock(&filter->notify_lock); > + return ret; > +} > +#endif > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c > index 24dbf634e2dd..b43e2a70b08c 100644 > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > @@ -40,6 +40,7 @@ > #include > #include > #include > +#include > > #define _GNU_SOURCE > #include > @@ -141,6 +142,24 @@ struct seccomp_data { > #define SECCOMP_FILTER_FLAG_LOG 2 > #endif > > +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER > +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 > + > +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U > + > +struct seccomp_notif { > + __u32 id; > + pid_t pid; > + struct seccomp_data data; > +}; > + > +struct seccomp_notif_resp { > + __u32 id; > + int error; > + long val; > +}; > +#endif > + > #ifndef seccomp > int seccomp(unsigned int op, unsigned int flags, void *args) > { > @@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock) > TEST(detect_seccomp_filter_flags) > { > unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, > - SECCOMP_FILTER_FLAG_LOG }; > + SECCOMP_FILTER_FLAG_LOG, > + SECCOMP_FILTER_FLAG_GET_LISTENER }; > unsigned int flag, all_flags; > int i; > long ret; > @@ -2845,6 +2865,98 @@ TEST(get_action_avail) > EXPECT_EQ(errno, EOPNOTSUPP); > } > > +static int user_trap_syscall(int nr, unsigned int flags) > +{ > + struct sock_filter filter[] = { > + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, > + offsetof(struct seccomp_data, nr)), > + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), > + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), > + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), > + }; > + > + struct sock_fprog prog = { > + .len = (unsigned short)ARRAY_SIZE(filter), > + .filter = filter, > + }; > + > + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); > +} > + > +#define USER_NOTIF_MAGIC 116983961184613L Is this just you mashing the numpad? :) > +TEST(get_user_notification_syscall) > +{ > + pid_t pid; > + long ret; > + int status, listener; > + struct seccomp_notif req; > + struct seccomp_notif_resp resp; > + > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + /* Check that we get -ENOSYS with no listener attached */ > + if (pid == 0) { > + ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0); > + ret = syscall(__NR_getpid); > + exit(ret >= 0 || errno != ENOSYS); > + } > + > + ASSERT_EQ(waitpid(pid, &status, 0), pid); > + ASSERT_EQ(true, WIFEXITED(status)); > + ASSERT_EQ(0, WEXITSTATUS(status)); > + > + /* Check that the basic notification machinery works */ > + listener = user_trap_syscall(__NR_getpid, > + SECCOMP_FILTER_FLAG_GET_LISTENER); > + ASSERT_GE(listener, 0); > + > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + ret = syscall(__NR_getpid); > + exit(ret != USER_NOTIF_MAGIC); > + } > + > + ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); > + > + resp.id = req.id; > + resp.error = 0; > + resp.val = USER_NOTIF_MAGIC; > + > + ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); > + > + ASSERT_EQ(waitpid(pid, &status, 0), pid); > + ASSERT_EQ(true, WIFEXITED(status)); > + ASSERT_EQ(0, WEXITSTATUS(status)); > + > + /* > + * Check that nothing bad happens when we kill the task in the middle > + * of a syscall. > + */ > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + ret = syscall(__NR_getpid); > + exit(ret != USER_NOTIF_MAGIC); > + } > + > + ret = read(listener, &req, sizeof(req)); > + ASSERT_EQ(ret, sizeof(req)); > + > + ASSERT_EQ(kill(pid, SIGKILL), 0); > + ASSERT_EQ(waitpid(pid, NULL, 0), pid); > + > + resp.id = req.id; > + ret = write(listener, &resp, sizeof(resp)); > + EXPECT_EQ(ret, -1); > + EXPECT_EQ(errno, EINVAL); > + > + close(listener); > +} Yay selftests! :) -Kees > + > /* > * TODO: > * - add microbenchmarks > -- > 2.14.1 > -- Kees Cook Pixel Security From keescook at chromium.org Tue Feb 13 21:29:23 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 13:29:23 -0800 Subject: [RFC 2/3] seccomp: hoist out filter resolving logic In-Reply-To: <20180204104946.25559-3-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-3-tycho@tycho.ws> Message-ID: On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > Hoist out the nth filter resolving logic that ptrace uses into a new > function. We'll use this in the next patch to implement the new > PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch > that I had sent a while ago; it significantly revamps the get_nth_filter > logic based on previous suggestions from Oleg. Is this the same as f06eae831f0c1fc5b982ea200daf552810e1dd55 ? Quick compare says yes? Either way, please rebase to v4.16-rc1 (or -rc2 in the future). :) -Kees -- Kees Cook Pixel Security From keescook at chromium.org Tue Feb 13 21:32:26 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 13 Feb 2018 13:32:26 -0800 Subject: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace In-Reply-To: <20180204104946.25559-4-tycho@tycho.ws> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-4-tycho@tycho.ws> Message-ID: On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace() > version which can acquire filters is useful. There are at least two reasons > this is preferable, even though it uses ptrace: > > 1. You can control tasks that aren't cooperating with you > 2. You can control tasks whose filters block sendmsg() and socket(); if the > task installs a filter which blocks these calls, there's no way with > SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task. I got worried for a second that this would get us into a many-to-many state, but I see init_listener enforces a single listener per filter. Whew. Seems legit. :) -Kees -- Kees Cook Pixel Security From lkml at metux.net Tue Feb 13 22:19:48 2018 From: lkml at metux.net (Enrico Weigelt) Date: Tue, 13 Feb 2018 22:19:48 +0000 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> Message-ID: <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> On 13.02.2018 22:12, Enrico Weigelt wrote: CC @containers at lists.linux-foundation.org > Hi folks, > > > I'm currently trying to implement plan9 semantics on Linux and > yet sorting out how to do the mount namespace handling. > > On plan9, any unprivileged process can create its own namespace > and mount/bind at will, while on Linux this requires CAP_SYS_ADMIN. > > What is the reason for not allowing arbitrary users to create their > own private mount namespace ? What could go wrong here ? > > IMHO, we could allow mount/bind under the following conditions: > > * the process is in a private mount namespace > * no suid-flag is honored (either force all mounts to nosuid or > ? completely mask it out) > * only certain whitelisted filesystems allowed (eg. 9P and FUSE) > > Maybe that all could be enabled by a new capability. > > > any suggestions ? > > > --mtx > -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From asarai at suse.de Tue Feb 13 22:27:51 2018 From: asarai at suse.de (Aleksa Sarai) Date: Wed, 14 Feb 2018 09:27:51 +1100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> Message-ID: <20180213222751.p3fyg7whg6jqlzz5@gordon> On 2018-02-13, Enrico Weigelt wrote: > On 13.02.2018 22:12, Enrico Weigelt wrote: > > I'm currently trying to implement plan9 semantics on Linux and > > yet sorting out how to do the mount namespace handling. > > > > On plan9, any unprivileged process can create its own namespace > > and mount/bind at will, while on Linux this requires CAP_SYS_ADMIN. > > > > What is the reason for not allowing arbitrary users to create their > > own private mount namespace ? What could go wrong here ? You can do this by creating a new user namespace (CLONE_NEWUSER), which then gives you the required permissions to create other namespaces (CLONE_NEWNS). This is how "rootless containers" or unprivileged containers operate. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From lkml at metux.net Wed Feb 14 00:01:49 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 00:01:49 +0000 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <20180213222751.p3fyg7whg6jqlzz5@gordon> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> Message-ID: <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> On 13.02.2018 22:27, Aleksa Sarai wrote: > You can do this by creating a new user namespace (CLONE_NEWUSER), which > then gives you the required permissions to create other namespaces > (CLONE_NEWNS). This is how "rootless containers" or unprivileged > containers operate. hmm, unshare -U doesn't work for me (even as root). But docker works, so user namespaces should be working. Any idea what could be wrong ? --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From mic at digikod.net Wed Feb 14 00:47:10 2018 From: mic at digikod.net (=?UTF-8?Q?Micka=c3=abl_Sala=c3=bcn?=) Date: Wed, 14 Feb 2018 01:47:10 +0100 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: <6b930450-37f8-ee3d-fab5-503ee50fa50b@digikod.net> seccomp-bpf does not use cBPF but a subset of it. The reason is that it is meant to reduce the attack surface of the kernel. By limiting the number of instructions allowed by seccomp-bpf, it really reduce the possibilities for an attacker to use seccomp-bpf as an entry point to attack the kernel. Moreover, this subset of cBPF is just fine to filter simple things as syscall numbers and arguments. Additional return codes may be added to extend seccomp features. FYI, I'm tweaking a new version of Landlock, which is not an extension of seccomp-bpf (as it was at first) but a standalone LSM leveraging eBPF to create security sandboxes (what seccomp-bpf does not do). I'll send this version soon but you can get a sneak peek here (the documentation will come with the final version): https://github.com/landlock-lsm/linux/commit/6c9131a5ccdf7aa599999b23f3a9ae2b73008f41 (please, do not comment this code now) I think the current seccomp-bpf bytecode is excellent for what it is meant to do. Landlock leverage eBPF to tackle a more complex problem (e.g. control access to files, and much more). It is not a seccomp replacement but a complementary layer of security. About the verbosity of seccomp filters, you may want to try other ways to write policies (e.g. https://github.com/google/kafel/ or https://android.googlesource.com/platform/external/minijail/+/master/tools/generate_seccomp_policy.py or https://github.com/servo/gaol/blob/master/platform/linux/seccomp.rs). Regards, Micka?l On 13/02/2018 16:42, Sargun Dhillon wrote: > This patchset enables seccomp filters to be written in eBPF. Although, > this patchset doesn't introduce much of the functionality enabled by > eBPF, it lays the ground work for it. > > It also introduces the capability to dump eBPF filters via the PTRACE > API in order to make it so that CHECKPOINT_RESTORE will be satisifed. > In the attached samples, there's an example of this. One can then use > BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, > and use that at reload time. > > The primary reason for not adding maps support in this patchset is > to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > If we have a map that the BPF program can read, it can potentially > "change" privileges after running. It seems like doing writes only > is safe, because it can be pure, and side effect free, and therefore > not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > to an agreement, this can be in a follow-up patchset. > > > Sargun Dhillon (3): > bpf, seccomp: Add eBPF filter capabilities > seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp > filters > bpf: Add eBPF seccomp sample programs > > arch/Kconfig | 7 ++ > include/linux/bpf_types.h | 3 + > include/linux/seccomp.h | 12 +++ > include/uapi/linux/bpf.h | 2 + > include/uapi/linux/ptrace.h | 5 +- > include/uapi/linux/seccomp.h | 15 ++-- > kernel/bpf/syscall.c | 1 + > kernel/ptrace.c | 3 + > kernel/seccomp.c | 185 ++++++++++++++++++++++++++++++++++++++----- > samples/bpf/Makefile | 9 +++ > samples/bpf/bpf_load.c | 9 ++- > samples/bpf/seccomp1_kern.c | 17 ++++ > samples/bpf/seccomp1_user.c | 34 ++++++++ > samples/bpf/seccomp2_kern.c | 24 ++++++ > samples/bpf/seccomp2_user.c | 66 +++++++++++++++ > 15 files changed, 362 insertions(+), 30 deletions(-) > create mode 100644 samples/bpf/seccomp1_kern.c > create mode 100644 samples/bpf/seccomp1_user.c > create mode 100644 samples/bpf/seccomp2_kern.c > create mode 100644 samples/bpf/seccomp2_user.c > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From asarai at suse.de Wed Feb 14 04:54:42 2018 From: asarai at suse.de (Aleksa Sarai) Date: Wed, 14 Feb 2018 15:54:42 +1100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> Message-ID: <20180214045442.jyv6zpbwz5glzi4z@gordon> On 2018-02-14, Enrico Weigelt wrote: > On 13.02.2018 22:27, Aleksa Sarai wrote: > > > You can do this by creating a new user namespace (CLONE_NEWUSER), which > > then gives you the required permissions to create other namespaces > > (CLONE_NEWNS). This is how "rootless containers" or unprivileged > > containers operate. > > hmm, unshare -U doesn't work for me (even as root). But docker works, > so user namespaces should be working. Any idea what could be wrong ? It depends how old your kernel is and what distro you use. Arch Linux disables user namespaces entirely, Debian requires that you set a sysctl to enable unprivileged user namespaces, and RHEL requires you to set both a sysctl and a kernel boot-flag. Also check how old your kernel is (unprivileged user namespace support was added in 3.8). Also Docker doesn't use user namespaces by default (you need to manually enable it with --userns-remap, check the docs for more details). You probably also want to be using "unshare -r" in your testing (as "unshare -U" will leave you without mapped users). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From lkml at metux.net Wed Feb 14 10:18:13 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 10:18:13 +0000 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <20180214045442.jyv6zpbwz5glzi4z@gordon> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> <20180214045442.jyv6zpbwz5glzi4z@gordon> Message-ID: <9c097fd9-3035-d5be-a829-fc18e7734f18@metux.net> On 14.02.2018 04:54, Aleksa Sarai wrote: > It depends how old your kernel is and what distro you use. Arch Linux > disables user namespaces entirely, Debian requires that you set a sysctl> to enable unprivileged user namespaces, and RHEL requires you to set> both a sysctl and a kernel boot-flag. Also check how old your kernel is> (unprivileged user namespace support was added in 3.8). Just tried on a mainline kernel (4.15). Same problem: root at alphabox:~ unshare -U -r unshare: unshare(0x14000000): Invalid argument root at alphabox:/proc/sys/user cat max_user_namespaces 5922 Am I missing something ? --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From asarai at suse.de Wed Feb 14 10:24:10 2018 From: asarai at suse.de (Aleksa Sarai) Date: Wed, 14 Feb 2018 21:24:10 +1100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <9c097fd9-3035-d5be-a829-fc18e7734f18@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> <20180214045442.jyv6zpbwz5glzi4z@gordon> <9c097fd9-3035-d5be-a829-fc18e7734f18@metux.net> Message-ID: <20180214102410.dxgbayb4i76h5exo@gordon> On 2018-02-14, Enrico Weigelt wrote: > On 14.02.2018 04:54, Aleksa Sarai wrote: > > > It depends how old your kernel is and what distro you use. Arch Linux > > > disables user namespaces entirely, Debian requires that you set a > sysctl> to enable unprivileged user namespaces, and RHEL requires you to > set> both a sysctl and a kernel boot-flag. Also check how old your kernel > is> (unprivileged user namespace support was added in 3.8). > Just tried on a mainline kernel (4.15). Same problem: > > root at alphabox:~ unshare -U -r > unshare: unshare(0x14000000): Invalid argument > root at alphabox:/proc/sys/user cat max_user_namespaces > 5922 What distribution are you using and which release? Also, are you trying to do this inside a Docker container or something similar (Docker has seccomp filters that block CLONE_NEWUSER by default, for instance). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From lkml at metux.net Wed Feb 14 11:27:44 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 12:27:44 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <20180214102410.dxgbayb4i76h5exo@gordon> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> <20180214045442.jyv6zpbwz5glzi4z@gordon> <9c097fd9-3035-d5be-a829-fc18e7734f18@metux.net> <20180214102410.dxgbayb4i76h5exo@gordon> Message-ID: <24ddea73-5c84-e098-caae-8a4c14834cbd@metux.net> On 14.02.2018 11:24, Aleksa Sarai wrote: > What distribution are you using and which release? On a self-compiled system. Forgot to enable namespaces in the kernel. Now it seems to work as root, but not as an unprivileged user: daemon at alphabox:~ unshare -r -U unshare: can't open '/proc/self/setgroups': Permission denied daemon at alphabox:~ unshare -f -r -U unshare: can't open '/proc/self/setgroups': Permission denied --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From richard.weinberger at gmail.com Wed Feb 14 11:30:18 2018 From: richard.weinberger at gmail.com (Richard Weinberger) Date: Wed, 14 Feb 2018 12:30:18 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <24ddea73-5c84-e098-caae-8a4c14834cbd@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> <20180214045442.jyv6zpbwz5glzi4z@gordon> <9c097fd9-3035-d5be-a829-fc18e7734f18@metux.net> <20180214102410.dxgbayb4i76h5exo@gordon> <24ddea73-5c84-e098-caae-8a4c14834cbd@metux.net> Message-ID: On Wed, Feb 14, 2018 at 12:27 PM, Enrico Weigelt wrote: > On 14.02.2018 11:24, Aleksa Sarai wrote: > >> What distribution are you using and which release? > > > On a self-compiled system. > > Forgot to enable namespaces in the kernel. Now it seems to work > as root, but not as an unprivileged user: > > > daemon at alphabox:~ unshare -r -U > unshare: can't open '/proc/self/setgroups': Permission denied > daemon at alphabox:~ unshare -f -r -U > unshare: can't open '/proc/self/setgroups': Permission denied > Please read http://man7.org/linux/man-pages/man7/user_namespaces.7.html setgroups is a corner case and needs special care. -- Thanks, //richard From mszeredi at redhat.com Wed Feb 14 12:28:12 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Wed, 14 Feb 2018 13:28:12 +0100 Subject: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems In-Reply-To: <61a37f0b159dd56825696d8d3beb8eaffdf1f72f.1512041070.git.dongsu@kinvolk.io> References: <61a37f0b159dd56825696d8d3beb8eaffdf1f72f.1512041070.git.dongsu@kinvolk.io> Message-ID: On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > From: Seth Forshee > > The user in control of a super block should be allowed to freeze > and thaw it. Relax the restrictions on the FIFREEZE and FITHAW > ioctls to require CAP_SYS_ADMIN in s_user_ns. Why is this required for unprivileged fuse? Fuse doesn't support freeze, so this seems to make no sense in the context of this patchset. Thanks, Miklos From lkml at metux.net Wed Feb 14 12:38:48 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 13:38:48 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> <20180213222751.p3fyg7whg6jqlzz5@gordon> <39b08c53-3449-3164-c1b1-44ac587dd4ea@metux.net> <20180214045442.jyv6zpbwz5glzi4z@gordon> <9c097fd9-3035-d5be-a829-fc18e7734f18@metux.net> <20180214102410.dxgbayb4i76h5exo@gordon> <24ddea73-5c84-e098-caae-8a4c14834cbd@metux.net> Message-ID: <4864d279-9a3f-eaf4-c297-ea34be604e41@metux.net> On 14.02.2018 12:30, Richard Weinberger wrote: > On Wed, Feb 14, 2018 at 12:27 PM, Enrico Weigelt wrote: >> On 14.02.2018 11:24, Aleksa Sarai wrote: >> >>> What distribution are you using and which release? >> >> >> On a self-compiled system. >> >> Forgot to enable namespaces in the kernel. Now it seems to work >> as root, but not as an unprivileged user: >> >> >> daemon at alphabox:~ unshare -r -U >> unshare: can't open '/proc/self/setgroups': Permission denied >> daemon at alphabox:~ unshare -f -r -U >> unshare: can't open '/proc/self/setgroups': Permission denied >> > > Please read http://man7.org/linux/man-pages/man7/user_namespaces.7.html > setgroups is a corner case and needs special care. I'm still confused. Does the unshare program do something wrong here ? Anyways, I doubt that user namespaces help solving my problem. What I'd like to achieve is that processes can manipulate their private namespace at will and mount other filesystems (primarily 9p and fuse). For that, I need to get rid of setuid (and per-file caps) for these private namespaces. --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From richard at sigma-star.at Wed Feb 14 12:53:40 2018 From: richard at sigma-star.at (Richard Weinberger) Date: Wed, 14 Feb 2018 13:53:40 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <4864d279-9a3f-eaf4-c297-ea34be604e41@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <4864d279-9a3f-eaf4-c297-ea34be604e41@metux.net> Message-ID: <2658681.ustYaP9yci@blindfold> Enrico, Am Mittwoch, 14. Februar 2018, 13:38:48 CET schrieb Enrico Weigelt: > On 14.02.2018 12:30, Richard Weinberger wrote: > > On Wed, Feb 14, 2018 at 12:27 PM, Enrico Weigelt wrote: > >> On 14.02.2018 11:24, Aleksa Sarai wrote: > >>> What distribution are you using and which release? > >> > >> On a self-compiled system. > >> > >> Forgot to enable namespaces in the kernel. Now it seems to work > >> as root, but not as an unprivileged user: > >> > >> > >> daemon at alphabox:~ unshare -r -U > >> unshare: can't open '/proc/self/setgroups': Permission denied > >> daemon at alphabox:~ unshare -f -r -U > >> unshare: can't open '/proc/self/setgroups': Permission denied > > > > Please read http://man7.org/linux/man-pages/man7/user_namespaces.7.html > > setgroups is a corner case and needs special care. > > I'm still confused. Does the unshare program do something wrong here ? It does what you ask it for. Also see the --setgroups switch. AFAICT --setgroups=deny is the new default, then your command line should just work. Maybe your unshare tool is too old. > Anyways, I doubt that user namespaces help solving my problem. > > What I'd like to achieve is that processes can manipulate their private > namespace at will and mount other filesystems (primarily 9p and fuse). > > For that, I need to get rid of setuid (and per-file caps) for these > private namespaces. This is exactly why we have the user namespace. In the user namespace you can create your own mount namespace and do (almost) whatever you want. Please note that you cannot mount any kind of filesystem. For FUSE, see https://lwn.net/Articles/684774/ Thanks, //richard -- sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria ATU66964118 - FN 374287y From mszeredi at redhat.com Wed Feb 14 13:44:34 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Wed, 14 Feb 2018 14:44:34 +0100 Subject: [PATCH 10/11] fuse: Allow user namespace mounts In-Reply-To: References: Message-ID: On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > From: Seth Forshee > > To be able to mount fuse from non-init user namespaces, it's necessary > to set FS_USERNS_MOUNT flag to fs_flags. > > Patch v4 is available: https://patchwork.kernel.org/patch/8944681/ > > Cc: linux-fsdevel at vger.kernel.org > Cc: linux-kernel at vger.kernel.org > Cc: Miklos Szeredi > Signed-off-by: Seth Forshee > [dongsu: add a simple commit messasge] > Signed-off-by: Dongsu Park > --- > fs/fuse/inode.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 7f6b2e55..8c98edee 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) > static struct file_system_type fuse_fs_type = { > .owner = THIS_MODULE, > .name = "fuse", > - .fs_flags = FS_HAS_SUBTYPE, > + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT, > .mount = fuse_mount, > .kill_sb = fuse_kill_sb_anon, > }; I think enabling FS_USERNS_MOUNT should be pretty safe. I was thinking opting out should be as simple as "chmod o-rw /dev/fuse". But that breaks libfuse, even though fusermount opens /dev/fuse in privileged mode, so it shouldn't. That can be fixed in libfuse, but it's an unfortunate bug and it also means /dev/fuse is configured with "crw-rw-rw-" in most cases. Which means it will be opting out, not opting in, which is the less safe version. > @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = { > .name = "fuseblk", > .mount = fuse_mount_blk, > .kill_sb = fuse_kill_sb_blk, > - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE, > + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT, > }; > MODULE_ALIAS_FS("fuseblk"); As I said, this hunk should be dropped from the first version, because it's possibly unsafe. Thanks, Miklos From lkml at metux.net Wed Feb 14 14:03:55 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 15:03:55 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <2658681.ustYaP9yci@blindfold> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <4864d279-9a3f-eaf4-c297-ea34be604e41@metux.net> <2658681.ustYaP9yci@blindfold> Message-ID: On 14.02.2018 13:53, Richard Weinberger wrote: > It does what you ask it for. > Also see the --setgroups switch.> AFAICT --setgroups=deny is the new default, then your command line should just> work. Maybe your unshare tool is too old. Also doesn't help: daemon at alphabox:~ unshare -U -r --setgroups=deny unshare: can't open '/proc/self/setgroups': Permission denied >> What I'd like to achieve is that processes can manipulate their private >> namespace at will and mount other filesystems (primarily 9p and fuse).>>>> For that, I need to get rid of setuid (and per-file caps) for these>> private namespaces.> > This is exactly why we have the user namespace. > In the user namespace you can create your own mount namespace and do (almost) > whatever you want. What's the exact relation between user and mnt namespace ? Why do I need an own user ns for private mnt ns ? (except for the suid bit, which I wanna get rid of anyways). --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From richard at sigma-star.at Wed Feb 14 14:19:24 2018 From: richard at sigma-star.at (Richard Weinberger) Date: Wed, 14 Feb 2018 15:19:24 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <2658681.ustYaP9yci@blindfold> Message-ID: <2050418.Dl5pXkWGsk@blindfold> Am Mittwoch, 14. Februar 2018, 15:03:55 CET schrieb Enrico Weigelt: > On 14.02.2018 13:53, Richard Weinberger wrote: > > It does what you ask it for. > Also see the --setgroups switch.> AFAICT > > --setgroups=deny is the new > default, then your command line should just> work. Maybe your unshare > tool is too old. > Also doesn't help: > > daemon at alphabox:~ unshare -U -r --setgroups=deny > unshare: can't open '/proc/self/setgroups': Permission denied Works here(tm). Can you debug it? Maybe we miss something obvious. > >> What I'd like to achieve is that processes can manipulate their private > >> >> namespace at will and mount other filesystems (primarily 9p and > fuse).>>>> For that, I need to get rid of setuid (and per-file caps) for > these>> private namespaces.> > > > This is exactly why we have the user namespace. > > In the user namespace you can create your own mount namespace and do > > (almost) whatever you want. > > What's the exact relation between user and mnt namespace ? > Why do I need an own user ns for private mnt ns ? (except for the suid > bit, which I wanna get rid of anyways). mount related system calls are root-only. Therefore you need the user namespace to become a root in your own little world. :) Thanks, //richard -- sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria ATU66964118 - FN 374287y From lkml at metux.net Wed Feb 14 15:02:18 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 16:02:18 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <2050418.Dl5pXkWGsk@blindfold> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <2658681.ustYaP9yci@blindfold> <2050418.Dl5pXkWGsk@blindfold> Message-ID: <4f620eb7-c00c-487b-2e06-8cc4c97af38c@metux.net> On 14.02.2018 15:19, Richard Weinberger wrote: > Works here(tm). > Can you debug it? Maybe we miss something obvious. daemon at alphabox:~ strace unshare -U -r --setgroups=deny execve("/bin/unshare", ["unshare", "-U", "-r", "--setgroups=deny"], 0x7ee51e0c /* 11 vars */) = 0 brk(NULL) = 0x58000 fcntl64(0, F_GETFD) = 0 fcntl64(1, F_GETFD) = 0 fcntl64(2, F_GETFD) = 0 access("/etc/suid-debug", F_OK) = -1 ENOENT (No such file or directory) uname({sysname="Linux", nodename="alphabox", ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f90000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/lib/tls/v7l/neon/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/v7l/neon/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/v7l/neon/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/v7l/neon", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/v7l/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/v7l/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/v7l/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/v7l", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/neon/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/neon/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/neon/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/neon", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/tls/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/tls", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/v7l/neon/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/v7l/neon/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/v7l/neon/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/v7l/neon", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/v7l/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/v7l/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/v7l/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/v7l", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/neon/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/neon/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/neon/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/neon", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/vfp/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat64("/lib/vfp", 0x7eae8710) = -1 ENOENT (No such file or directory) open("/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0Yi\1\0004\0\0\0"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=878136, ...}) = 0 mmap2(NULL, 947496, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x76e82000 mprotect(0x76f55000, 61440, PROT_NONE) = 0 mmap2(0x76f64000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xd2000) = 0x76f64000 mmap2(0x76f67000, 9512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x76f67000 close(3) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f8f000 set_tls(0x76f8f4c0, 0x76f8fb98, 0x76f92050, 0x76f8f4c0, 0x76f92050) = 0 mprotect(0x76f64000, 8192, PROT_READ) = 0 mprotect(0x76f91000, 4096, PROT_READ) = 0 getuid32() = 1 stat64("/etc/busybox.conf", {st_mode=S_IFREG|0644, st_size=198, ...}) = 0 brk(NULL) = 0x58000 brk(0x79000) = 0x79000 open("/etc/busybox.conf", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=198, ...}) = 0 read(3, "[SUID]\n#lines starting with # ar"..., 1024) = 198 read(3, "", 1024) = 0 close(3) = 0 getgid32() = 1 setgid32(1) = 0 setuid32(1) = 0 geteuid32() = 1 getegid32() = 1 unshare(CLONE_NEWUTS|CLONE_NEWUSER) = 0 open("/proc/self/setgroups", O_WRONLY|O_LARGEFILE) = 3 write(3, "deny", 4) = 4 close(3) = 0 open("/proc/self/uid_map", O_WRONLY|O_LARGEFILE) = 3 write(3, "1 0 1", 5) = -1 EPERM (Operation not permitted) write(2, "unshare: write error: Operation "..., 46unshare: write error: Operation not permitted ) = 46 exit_group(1) = ? +++ exited with 1 +++ Seems it fails to write the uid map. Is the order of setgroups vs uid_map correct ? >> What's the exact relation between user and mnt namespace ? >> Why do I need an own user ns for private mnt ns ? (except for the suid >> bit, which I wanna get rid of anyways). > > mount related system calls are root-only. Therefore you need the user > namespace to become a root in your own little world. :) I'm looking for a way to do that w/o being root (or something similar). Actually, I don't like to change the user namespace, as it would cause a lot of trouble w/ the /dev/cap[hash|use] devices, which I'm using for user switching (as said: I'm going to get rid of suid completely). --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From richard at sigma-star.at Wed Feb 14 15:17:35 2018 From: richard at sigma-star.at (Richard Weinberger) Date: Wed, 14 Feb 2018 16:17:35 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <4f620eb7-c00c-487b-2e06-8cc4c97af38c@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <2050418.Dl5pXkWGsk@blindfold> <4f620eb7-c00c-487b-2e06-8cc4c97af38c@metux.net> Message-ID: <60748622.exvCVAzLTp@blindfold> Enrico, Am Mittwoch, 14. Februar 2018, 16:02:18 CET schrieb Enrico Weigelt: > stat64("/etc/busybox.conf", {st_mode=S_IFREG|0644, st_size=198, ...}) = 0 busybox... > brk(NULL) = 0x58000 > brk(0x79000) = 0x79000 > open("/etc/busybox.conf", O_RDONLY|O_LARGEFILE) = 3 > fstat64(3, {st_mode=S_IFREG|0644, st_size=198, ...}) = 0 > read(3, "[SUID]\n#lines starting with # ar"..., 1024) = 198 > read(3, "", 1024) = 0 > close(3) = 0 > getgid32() = 1 > setgid32(1) = 0 > setuid32(1) = 0 > geteuid32() = 1 > getegid32() = 1 > unshare(CLONE_NEWUTS|CLONE_NEWUSER) = 0 > open("/proc/self/setgroups", O_WRONLY|O_LARGEFILE) = 3 > write(3, "deny", 4) = 4 > close(3) = 0 > open("/proc/self/uid_map", O_WRONLY|O_LARGEFILE) = 3 > write(3, "1 0 1", 5) = -1 EPERM (Operation not permitted) This mapping looks broken. Please report to busybox folks. >From taking a *very* quick look into busybox source, I suspect this should fix it: diff --git a/util-linux/unshare.c b/util-linux/unshare.c index 875e3f86e304..3f59cf4d27c2 100644 --- a/util-linux/unshare.c +++ b/util-linux/unshare.c @@ -350,9 +350,9 @@ int unshare_main(int argc UNUSED_PARAM, char **argv) * in that user namespace. */ xopen_xwrite_close(PATH_PROC_SETGROUPS, "deny"); - sprintf(uidmap_buf, "%u 0 1", (unsigned)reuid); + sprintf(uidmap_buf, "0 %u 1", (unsigned)reuid); xopen_xwrite_close(PATH_PROC_UIDMAP, uidmap_buf); - sprintf(uidmap_buf, "%u 0 1", (unsigned)regid); + sprintf(uidmap_buf, "0 %u 1", (unsigned)regid); xopen_xwrite_close(PATH_PROC_GIDMAP, uidmap_buf); } else if (setgrp_str) { Thanks, //richard -- sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria ATU66964118 - FN 374287y From tycho at tycho.ws Wed Feb 14 15:29:58 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Wed, 14 Feb 2018 08:29:58 -0700 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> Message-ID: <20180214152958.cjgwh2k52zji2jxk@cisco> Hey Kees, Thanks for taking a look! On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: > On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > > This patch introduces a means for syscalls matched in seccomp to notify > > some other task that a particular filter has been triggered. > > > > The motivation for this is primarily for use with containers. For example, > > if a container does an init_module(), we obviously don't want to load this > > untrusted code, which may be compiled for the wrong version of the kernel > > anyway. Instead, we could parse the module image, figure out which module > > the container is trying to load and load it on the host. > > > > As another example, containers cannot mknod(), since this checks > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > > coding some whitelist in the kernel. Another example is mount(), which has > > many security restrictions for good reason, but configuration or runtime > > knowledge could potentially be used to relax these restrictions. > > Related to the eBPF seccomp thread, can the logic for these things be > handled entirely by eBPF? My assumption is that you still need to stop > the process to do something (i.e. do a mknod, or a mount) before > letting it continue. Is there some "wait for notification" system in > eBPF? I replied in the other thread (https://patchwork.ozlabs.org/cover/872938/#1856642 for those following along at home), but no, at least not that I know of. > > The actual implementation of this is fairly small, although getting the > > synchronization right was/is slightly complex. Also worth noting that there > > is one race still present: > > > > 1. a task does a SECCOMP_RET_USER_NOTIF > > 2. the userspace handler reads this notification > > 3. the task dies > > 4. a new task with the same pid starts > > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > > that the previous one did > > 6. the userspace handler writes a response > > > > There's no way to distinguish this case right now. Maybe we care, maybe we > > don't, but it's worth noting. > > So, I'd like to avoid the cookie if possible (surprise). Why isn't it > possible to close the kernel-side of the fd to indicate that it lost > the pid it was attached to? Because the fd is for a filter, not a task. > Is this just that the reader has no idea > who is sending messages? So the risk is a fork/die loop within the > same process tree (i.e. attached to the same filter)? Hrmpf. I can't > think of a better way to handle the > one(fd)-to-many(task-with-that-filter-attached) situation... Yes, exactly. The cookie just adds uniqueness, and as Andy pointed out if we switch to u64, the race above basically ("u64 should be enough for anybody") goes away. > > Right now the interface is a simple structure copy across a file > > descriptor. We could potentially invent something fancier. > > I wonder if this communication should be netlink, which gives a more > well-structured way to describe what's on the wire? The reason I ask > is because if we ever change the seccomp_data structure, we'll now > have two places where we need to deal with it (the first being within > the BPF itself). My initial idea was to prefix the communication with > a size field, then send the structure, and then I had nightmares, and > realized this was basically netlink reinvented. I suggested netlink in LA, and everyone (especially Andy) groaned very loudly :). I'm happy to switch it to netlink if you like, although i think memcpy() of structs should be safe here, since the return value from read or write can indicate the size of things. > > Finally, it's worth noting that the classic seccomp TOCTOU of reading > > memory data from the task still applies here, but can be avoided with > > careful design of the userspace handler: if the userspace handler reads all > > of the task memory that is necessary before applying its security policy, > > the tracee's subsequent memory edits will not be read by the tracer. > > Is this really true? Couldn't a multi-threaded process muck with > memory out from under both the manager and the stopped process? Sure, but as long as the manager copies the relevant arguments out of the tracee's memory *before* evaluating whether it's safe to do the thing the tracee wants to do, it's ok. The assumption here is that the tracee can't corrupt the manager's memory (because if it could, lots of other things would already be broken). > > /* > > * All BPF programs must return a 32-bit value. > > @@ -34,6 +35,7 @@ > > #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD > > #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ > > #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ > > +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ > > #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ > > /me tries to come up with an ordering rationale here and fails. > > An ERRNO filter would block a USER_NOTIF because it's unconditional. > TRACE could be either, USER_NOTIF could be either. > > This means TRACE rules would be bumped by a USER_NOTIF... hmm. Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all seemed more important than USER_NOTIF, but TRACE didn't. I don't have a strong opinion about what to do here, because users can adjust their filters accordingly. Let me know what you prefer. > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > > I wonder if it's time to split up seccomp.c ... probably not, but I've > always been unhappy with the #ifdefs even for just regular _FILTER. ;) A reasonable question. I'm happy to do that as a separate series before this one goes in if you want. > > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, > > + size_t size, loff_t *ppos) > > +{ > > + struct seccomp_filter *filter = file->private_data; > > + struct seccomp_notif_resp resp = {}; > > + struct seccomp_knotif *knotif = NULL; > > + struct list_head *cur; > > + ssize_t ret = -EINVAL; > > + > > + /* No partial writes. */ > > + if (*ppos != 0) > > + return -EINVAL; > > + > > + size = min_t(size_t, size, sizeof(resp)); > > In this case, we can't use min_t, size _must_ be == sizeof(resp), > otherwise we're operating on what's in the stack (which is zeroed, but > still). I'm not sure I follow. If the user passes in an old (smaller) struct seccomp_notif_resp, we don't want to copy more than they specified. If they pass in a bigger one, this will be sizeof(resp). > > + if (copy_from_user(&resp, buf, size)) > > + return -EFAULT; > > + > > + ret = mutex_lock_interruptible(&filter->notify_lock); > > + if (ret < 0) > > + return ret; > > + > > + list_for_each(cur, &filter->notifications) { > > + knotif = list_entry(cur, struct seccomp_knotif, list); > > + > > + if (knotif->id == resp.id) > > + break; > > So we're finding the matching id here. Now, I'm trying to think about > how this will look in real-world use: the pid will be _blocked_ while > this happening. And all the other pids that trip this filter will > _also_ be blocked, since they're all waiting for the reader to read > and respond. The risk is pid death while waiting, and having another > appear with the same pid, trigger the same filter, get blocked, and > then the reader replies for the old pid, and the new pid gets the > results? Yep, exactly. > Since this notification queue is already linear, can't we use ordering > to enforce this? i.e. only the pid at the head of the filter > notification queue is going to have anything happening to it. Or is > the idea to have multiple readers/writers of the fd? I'm not really sure how we prevent multiple readers/writers of the fd. But even with a single writer, the case you described "could" happen (although again, with u64 cookies it shouldn't be a problem). I'm not sure how ordering helps us though; the problem is really that one entry for a pid was deleted, and a whole new one was created. So ordering will look ok, but the response will go to the wrong pid. > > + } > > + > > + if (!knotif || knotif->id != resp.id) { > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + ret = size; > > + knotif->state = SECCOMP_NOTIFY_WRITE; > > + knotif->error = resp.error; > > + knotif->val = resp.val; > > + complete(&knotif->ready); > > +out: > > + mutex_unlock(&filter->notify_lock); > > + return ret; > > +} > > + > > +static const struct file_operations seccomp_notify_ops = { > > + .read = seccomp_notify_read, > > + .write = seccomp_notify_write, > > + /* TODO: poll */ > > What's needed for poll? I think you've got all the pieces you need > already, i.e. wait queue, notifications, etc. Nothing, I just didn't implement it. I will do so for v2. > > + .release = seccomp_notify_release, > > +}; > > + > > +static struct file *init_listener(struct seccomp_filter *filter) > > +{ > > + struct file *ret; > > + > > + mutex_lock(&filter->notify_lock); > > + if (filter->has_listener) { > > + mutex_unlock(&filter->notify_lock); > > + return ERR_PTR(-EBUSY); > > + } > > + > > + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, > > + filter, O_RDWR); > > + if (IS_ERR(ret)) { > > + __put_seccomp_filter(filter); > > + } else { > > + /* > > + * Intentionally don't put_seccomp_filter(). The file > > + * has a reference to it now. > > + */ > > + filter->has_listener = true; > > + } > > I spent some time staring at this, and I don't see it: where is the > get_() for this? The caller of init_listener() already does a put() on > the failure path. It seems like there is a get() missing near the > start of init_listener(), or I've entirely missed something. Ugh, yes. For the SECCOMP_FILTER_FLAG_GET_LISTENER case, you're right. Originally I only had the ptrace-based one, and that has a get() in get_nth_filter(), so the comment makes sense in that case. I'll straighten this out for v2 and > (Regardless, I think the usage counting need a comment somewhere, > maybe near the top of seccomp.c with the field?) ...add a comment. > > +#define USER_NOTIF_MAGIC 116983961184613L > > Is this just you mashing the numpad? :) Is there some better way to generate magic numbers? :) Cheers, Tycho From tycho at tycho.ws Wed Feb 14 15:33:06 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Wed, 14 Feb 2018 08:33:06 -0700 Subject: [RFC 2/3] seccomp: hoist out filter resolving logic In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-3-tycho@tycho.ws> Message-ID: <20180214153306.m3wlmz6zwjqsav36@smitten> On Tue, Feb 13, 2018 at 01:29:23PM -0800, Kees Cook wrote: > On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > > Hoist out the nth filter resolving logic that ptrace uses into a new > > function. We'll use this in the next patch to implement the new > > PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch > > that I had sent a while ago; it significantly revamps the get_nth_filter > > logic based on previous suggestions from Oleg. > > Is this the same as f06eae831f0c1fc5b982ea200daf552810e1dd55 ? Quick > compare says yes? Either way, please rebase to v4.16-rc1 (or -rc2 in > the future). :) Yep, there was no tagged tree with that when I did these; I'll do that for the next version. Cheers, Tycho From tycho at tycho.ws Wed Feb 14 15:33:59 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Wed, 14 Feb 2018 08:33:59 -0700 Subject: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-4-tycho@tycho.ws> Message-ID: <20180214153359.6wj6wclsqvgj4jlt@smitten> On Tue, Feb 13, 2018 at 01:32:26PM -0800, Kees Cook wrote: > On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace() > > version which can acquire filters is useful. There are at least two reasons > > this is preferable, even though it uses ptrace: > > > > 1. You can control tasks that aren't cooperating with you > > 2. You can control tasks whose filters block sendmsg() and socket(); if the > > task installs a filter which blocks these calls, there's no way with > > SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task. > > I got worried for a second that this would get us into a many-to-many > state, but I see init_listener enforces a single listener per filter. > Whew. Seems legit. :) Yes, although if you sendmsg() the listener fd, you could still get into that state, so it's still maybe a concern? Tycho From luto at amacapital.net Wed Feb 14 17:19:52 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Wed, 14 Feb 2018 17:19:52 +0000 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: <20180214152958.cjgwh2k52zji2jxk@cisco> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180214152958.cjgwh2k52zji2jxk@cisco> Message-ID: On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen wrote: > Hey Kees, > > Thanks for taking a look! > > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: >> > This patch introduces a means for syscalls matched in seccomp to notify >> > some other task that a particular filter has been triggered. >> > >> > The motivation for this is primarily for use with containers. For example, >> > if a container does an init_module(), we obviously don't want to load this >> > untrusted code, which may be compiled for the wrong version of the kernel >> > anyway. Instead, we could parse the module image, figure out which module >> > the container is trying to load and load it on the host. >> > >> > As another example, containers cannot mknod(), since this checks >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard >> > coding some whitelist in the kernel. Another example is mount(), which has >> > many security restrictions for good reason, but configuration or runtime >> > knowledge could potentially be used to relax these restrictions. >> >> Related to the eBPF seccomp thread, can the logic for these things be >> handled entirely by eBPF? My assumption is that you still need to stop >> the process to do something (i.e. do a mknod, or a mount) before >> letting it continue. Is there some "wait for notification" system in >> eBPF? > > I replied in the other thread > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those > following along at home), but no, at least not that I know of. eBPF can call functions. One of those functions could put the caller to sleep. In fact, I think I once proposed doing this for the seccomp logging action as well. >> I wonder if this communication should be netlink, which gives a more >> well-structured way to describe what's on the wire? The reason I ask >> is because if we ever change the seccomp_data structure, we'll now >> have two places where we need to deal with it (the first being within >> the BPF itself). My initial idea was to prefix the communication with >> a size field, then send the structure, and then I had nightmares, and >> realized this was basically netlink reinvented. > > I suggested netlink in LA, and everyone (especially Andy) groaned very > loudly :). I'm happy to switch it to netlink if you like, although i > think memcpy() of structs should be safe here, since the return value > from read or write can indicate the size of things. I could easily get on board with "netlink" (i.e. NLA) messages sent over an fd. I will object strongly to the use of netlink *sockets*. > >> An ERRNO filter would block a USER_NOTIF because it's unconditional. >> TRACE could be either, USER_NOTIF could be either. >> >> This means TRACE rules would be bumped by a USER_NOTIF... hmm. > > Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all > seemed more important than USER_NOTIF, but TRACE didn't. I don't have > a strong opinion about what to do here, because users can adjust their > filters accordingly. Let me know what you prefer. If we switched to eBPF functions, this whole issue goes away. From lkml at metux.net Wed Feb 14 17:21:12 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 18:21:12 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <60748622.exvCVAzLTp@blindfold> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <2050418.Dl5pXkWGsk@blindfold> <4f620eb7-c00c-487b-2e06-8cc4c97af38c@metux.net> <60748622.exvCVAzLTp@blindfold> Message-ID: On 14.02.2018 16:17, Richard Weinberger wrote: > From taking a *very* quick look into busybox source, I suspect this should fix > it: > > diff --git a/util-linux/unshare.c b/util-linux/unshare.c > index 875e3f86e304..3f59cf4d27c2 100644 > --- a/util-linux/unshare.c > +++ b/util-linux/unshare.c > @@ -350,9 +350,9 @@ int unshare_main(int argc UNUSED_PARAM, char **argv) > * in that user namespace. > */ > xopen_xwrite_close(PATH_PROC_SETGROUPS, "deny"); > - sprintf(uidmap_buf, "%u 0 1", (unsigned)reuid); > + sprintf(uidmap_buf, "0 %u 1", (unsigned)reuid); > xopen_xwrite_close(PATH_PROC_UIDMAP, uidmap_buf); > - sprintf(uidmap_buf, "%u 0 1", (unsigned)regid); > + sprintf(uidmap_buf, "0 %u 1", (unsigned)regid); > xopen_xwrite_close(PATH_PROC_GIDMAP, uidmap_buf); > } else > if (setgrp_str) { > hmm, now it works, but only when strace'ing it. that's really strange. But still I wonder whether user_ns really solves my problem, as I don't want to create sandboxed users, but only private namespaces just like on Plan9. --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From tycho at tycho.ws Wed Feb 14 17:23:00 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Wed, 14 Feb 2018 10:23:00 -0700 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180214152958.cjgwh2k52zji2jxk@cisco> Message-ID: <20180214172300.7v2pre4rv4zzrj3s@cisco> On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote: > On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen wrote: > > Hey Kees, > > > > Thanks for taking a look! > > > > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: > >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > >> > This patch introduces a means for syscalls matched in seccomp to notify > >> > some other task that a particular filter has been triggered. > >> > > >> > The motivation for this is primarily for use with containers. For example, > >> > if a container does an init_module(), we obviously don't want to load this > >> > untrusted code, which may be compiled for the wrong version of the kernel > >> > anyway. Instead, we could parse the module image, figure out which module > >> > the container is trying to load and load it on the host. > >> > > >> > As another example, containers cannot mknod(), since this checks > >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > >> > coding some whitelist in the kernel. Another example is mount(), which has > >> > many security restrictions for good reason, but configuration or runtime > >> > knowledge could potentially be used to relax these restrictions. > >> > >> Related to the eBPF seccomp thread, can the logic for these things be > >> handled entirely by eBPF? My assumption is that you still need to stop > >> the process to do something (i.e. do a mknod, or a mount) before > >> letting it continue. Is there some "wait for notification" system in > >> eBPF? > > > > I replied in the other thread > > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those > > following along at home), but no, at least not that I know of. > > eBPF can call functions. One of those functions could put the caller > to sleep. In fact, I think I once proposed doing this for the seccomp > logging action as well. Yes, true. We could always add a bpf_func_map_lookup_wait or something. I can look into that if it's preferable. From luto at amacapital.net Wed Feb 14 17:25:00 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Wed, 14 Feb 2018 17:25:00 +0000 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook wrote: > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >> This patchset enables seccomp filters to be written in eBPF. Although, >> this patchset doesn't introduce much of the functionality enabled by >> eBPF, it lays the ground work for it. >> >> It also introduces the capability to dump eBPF filters via the PTRACE >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >> In the attached samples, there's an example of this. One can then use >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >> and use that at reload time. >> >> The primary reason for not adding maps support in this patchset is >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >> If we have a map that the BPF program can read, it can potentially >> "change" privileges after running. It seems like doing writes only >> is safe, because it can be pure, and side effect free, and therefore >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >> to an agreement, this can be in a follow-up patchset. > > What's the reason for adding eBPF support? seccomp shouldn't need it, > and it only makes the code more complex. I'd rather stick with cBPF > until we have an overwhelmingly good reason to use eBPF as a "native" > seccomp filter language. > I can think of two fairly strong use cases for eBPF's ability to call functions: logging and Tycho's user notifier thing. They let seccomp filters *do* something synchronously, which is a better match for both use cases than the current approach of "hey, I'd like to log this syscall, but it's really awkward to attach other information or to track exactly *which* filter logged what or to stack any of it". Also, eBPF's stronger arithmetic support would allow bitops (I think), which would make "is the nr in this list" quite a bit faster in some cases. From tycho at tycho.ws Wed Feb 14 17:32:22 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Wed, 14 Feb 2018 10:32:22 -0700 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: <20180214173222.kvos6izqcywkuyi5@cisco> On Wed, Feb 14, 2018 at 05:25:00PM +0000, Andy Lutomirski wrote: > On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook wrote: > > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: > >> This patchset enables seccomp filters to be written in eBPF. Although, > >> this patchset doesn't introduce much of the functionality enabled by > >> eBPF, it lays the ground work for it. > >> > >> It also introduces the capability to dump eBPF filters via the PTRACE > >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. > >> In the attached samples, there's an example of this. One can then use > >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, > >> and use that at reload time. > >> > >> The primary reason for not adding maps support in this patchset is > >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > >> If we have a map that the BPF program can read, it can potentially > >> "change" privileges after running. It seems like doing writes only > >> is safe, because it can be pure, and side effect free, and therefore > >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > >> to an agreement, this can be in a follow-up patchset. > > > > What's the reason for adding eBPF support? seccomp shouldn't need it, > > and it only makes the code more complex. I'd rather stick with cBPF > > until we have an overwhelmingly good reason to use eBPF as a "native" > > seccomp filter language. > > > > I can think of two fairly strong use cases for eBPF's ability to call > functions: logging and Tycho's user notifier thing. Worth noting that there is one additional thing that I didn't implement, but which would be nice and is probably not possible with eBPF (at least, not without a bunch of additional infrastructure): passing fds back to the tracee from the manager if you intercept socket(), or accept() or something. This could again be accomplished via other means, though it would be a lot nicer to have a primitive for it. That said, I think it's more important that something like this gets in, vs. that it gets in with some approach like I've posted. So if we go with eBPF and some wait functions and acknowledge that you have to do some ptrace surgery, that is better than nothing. Tycho From richard at sigma-star.at Wed Feb 14 17:50:29 2018 From: richard at sigma-star.at (Richard Weinberger) Date: Wed, 14 Feb 2018 18:50:29 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <60748622.exvCVAzLTp@blindfold> Message-ID: <4042675.OEy7g9C5ya@blindfold> Am Mittwoch, 14. Februar 2018, 18:21:12 CET schrieb Enrico Weigelt: > On 14.02.2018 16:17, Richard Weinberger wrote: > > From taking a *very* quick look into busybox source, I suspect this > > should fix> > > it: > > > > diff --git a/util-linux/unshare.c b/util-linux/unshare.c > > index 875e3f86e304..3f59cf4d27c2 100644 > > --- a/util-linux/unshare.c > > +++ b/util-linux/unshare.c > > @@ -350,9 +350,9 @@ int unshare_main(int argc UNUSED_PARAM, char **argv) > > > > * in that user namespace. > > */ > > > > xopen_xwrite_close(PATH_PROC_SETGROUPS, "deny"); > > > > - sprintf(uidmap_buf, "%u 0 1", (unsigned)reuid); > > + sprintf(uidmap_buf, "0 %u 1", (unsigned)reuid); > > > > xopen_xwrite_close(PATH_PROC_UIDMAP, uidmap_buf); > > > > - sprintf(uidmap_buf, "%u 0 1", (unsigned)regid); > > + sprintf(uidmap_buf, "0 %u 1", (unsigned)regid); > > > > xopen_xwrite_close(PATH_PROC_GIDMAP, uidmap_buf); > > > > } else > > if (setgrp_str) { > > hmm, now it works, but only when strace'ing it. > that's really strange. On my box, with my patch applied, also busybox works now. > But still I wonder whether user_ns really solves my problem, as I don't > want to create sandboxed users, but only private namespaces just like > on Plan9. Well, I'd be surprised if that works out of the box. Since you're posting on LKML I assumed you're hacking the kernel to support plan9-alike namespaces... Thanks, //richard -- sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria ATU66964118 - FN 374287y From lkml at metux.net Wed Feb 14 17:58:00 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 18:58:00 +0100 Subject: [PATCH] p9caps: add Plan9 capability devices In-Reply-To: <20180214145650.GA2102@mail.hallyn.com> References: <40d4c871-a16a-7b8f-2d4a-422a5a490693@infradead.org> <20180211215028.16210-1-metux@gmx.de> <20180211215028.16210-2-metux@gmx.de> <20180213071655.GA11240@mail.hallyn.com> <3a99edaf-0365-ec7e-4d2f-1e21dea007ac@gmx.de> <20180214145650.GA2102@mail.hallyn.com> Message-ID: <23e3c1de-13e0-34ab-e2a8-b40c59e5b986@metux.net> On 14.02.2018 15:56, Serge E. Hallyn wrote: > If it's an out of tree module you'd have to do it this way, but if > it's in-tree, even as a module, adding a bit to the userns struct> would imo be ok. Assuming one doesn't try to load the module when the kernel image previously was built w/o it ;-) (well, could export some dummy symbol for protection ;-)). OTOH, that raises the question, where / how exactly the cap list destruction / expiry should be done. My original plan was adding a timer in the p9caps module that just scans for old entries. Should the userns code just call back on userns destruction ? (in that case it would be tricky to have it as a module) >> the whole thing might become a bit more complex when introducing >> plan9-like unprivileged mount operations. haven't sorted out how to >> do that yet. > > I hope you'll have a discussion here about that first. Yes, of course - that's why I'm here :p My current idea is introducing some special flag for disabling suid completely and switch into an private namespace, where now the unprivileged user can mount at will and create new mnt namespaces, just like on Plan9. I'll try some qnd hacks w/ a new syscall, lets see where it leads to, and then sort out how to do that in a more appropriate way. > Now speaking practically, I love the caphash idea, but it does > have issues with a modern login system. There are privileged > things which login needs to do besides changing uid, including but > not limited to: > 1. setting limits > 2. setting loginuid, > 3. mounting things (polyinstantiated /tmp, decrypted homedir, etc) > 4. setting selinux context For now, I don't think that's necessary for doing things the Plan9 way. Perhaps we later could extend the /dev/caphash interface w/ additional parameters for that. > (and of course gplv3 as Al pointed out is a blocker) already fixed. --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From lkml at metux.net Wed Feb 14 18:01:52 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 19:01:52 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <4042675.OEy7g9C5ya@blindfold> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <60748622.exvCVAzLTp@blindfold> <4042675.OEy7g9C5ya@blindfold> Message-ID: <794929ce-0ecb-4c93-d51e-e94fcf749cfa@metux.net> On 14.02.2018 18:50, Richard Weinberger wrote: >> hmm, now it works, but only when strace'ing it. >> that's really strange. > > On my box, with my patch applied, also busybox works now. hmm, w/o strace, too ? Which version are you using ? I've got 1.27.2 >> But still I wonder whether user_ns really solves my problem, as I don't >> want to create sandboxed users, but only private namespaces just like >> on Plan9. > > Well, I'd be surprised if that works out of the box. > Since you're posting on LKML I assumed you're hacking the kernel to support > plan9-alike namespaces... Yes, that's the plan :) --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From richard at sigma-star.at Wed Feb 14 18:12:33 2018 From: richard at sigma-star.at (Richard Weinberger) Date: Wed, 14 Feb 2018 19:12:33 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <794929ce-0ecb-4c93-d51e-e94fcf749cfa@metux.net> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <4042675.OEy7g9C5ya@blindfold> <794929ce-0ecb-4c93-d51e-e94fcf749cfa@metux.net> Message-ID: <6753239.SmZu9LK57z@blindfold> Am Mittwoch, 14. Februar 2018, 19:01:52 CET schrieb Enrico Weigelt: > On 14.02.2018 18:50, Richard Weinberger wrote: > >> hmm, now it works, but only when strace'ing it. > >> that's really strange. > > > > On my box, with my patch applied, also busybox works now. > > hmm, w/o strace, too ? Sure. > Which version are you using ? I've got 1.27.2 Both master and 1.12.x BTW: Your issue is fixed/known. Just checked. commit 1b510900e24459353922a1bc83c0b58bc8bafe1c Author: Denys Vlasenko Date: Thu Nov 9 16:06:33 2017 +0100 unshare: -r should map root to user, not the other way around Signed-off-by: Denys Vlasenko Thanks, //richard -- sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria ATU66964118 - FN 374287y From lkml at metux.net Wed Feb 14 18:32:53 2018 From: lkml at metux.net (Enrico Weigelt) Date: Wed, 14 Feb 2018 19:32:53 +0100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <6753239.SmZu9LK57z@blindfold> References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <4042675.OEy7g9C5ya@blindfold> <794929ce-0ecb-4c93-d51e-e94fcf749cfa@metux.net> <6753239.SmZu9LK57z@blindfold> Message-ID: <54f88b0d-282d-db79-9db6-ff82a16aaa62@metux.net> On 14.02.2018 19:12, Richard Weinberger wrote: > BTW: Your issue is fixed/known. Just checked. aha, on 1.2.28 ... I'll have to upgrade. --mtx -- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info at metux.net -- +49-151-27565287 From asarai at suse.de Wed Feb 14 20:39:33 2018 From: asarai at suse.de (Aleksa Sarai) Date: Thu, 15 Feb 2018 07:39:33 +1100 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <2050418.Dl5pXkWGsk@blindfold> <4f620eb7-c00c-487b-2e06-8cc4c97af38c@metux.net> <60748622.exvCVAzLTp@blindfold> Message-ID: <20180214203933.55yraxue7hpup65x@gordon> On 2018-02-14, Enrico Weigelt wrote: > But still I wonder whether user_ns really solves my problem, as I don't > want to create sandboxed users, but only private namespaces just like > on Plan9. On Linux you need to have CAP_SYS_ADMIN (in the user_ns that owns your current mnt_ns) in order to mount anything, and to create any namespaces (in your current user_ns). So, in order to use the functionality of mnt_ns (the ability to create mounts only a subset of processes can see) as an unprivileged user, you need to use user_ns. (Note there is an additional restriction, namely that a mnt_ns that was set up in the non-root user_ns cannot mount any filesystems that do not have the FS_USERNS_MOUNT option set. This is also for security, as exposing the kernel filesystem parser to arbitrary data by unprivileged users wasn't deemed to be a safe thing to do. The unprivileged FUSE work that Richard linked to will likely be useful for pushing FS_USERNS_MOUNT into more filesystems -- like 9p.) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From alexei.starovoitov at gmail.com Thu Feb 15 04:30:29 2018 From: alexei.starovoitov at gmail.com (Alexei Starovoitov) Date: Wed, 14 Feb 2018 20:30:29 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <20180214173222.kvos6izqcywkuyi5@cisco> References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> <20180214173222.kvos6izqcywkuyi5@cisco> Message-ID: <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote: > > > > > > What's the reason for adding eBPF support? seccomp shouldn't need it, > > > and it only makes the code more complex. I'd rather stick with cBPF > > > until we have an overwhelmingly good reason to use eBPF as a "native" > > > seccomp filter language. > > > > > > > I can think of two fairly strong use cases for eBPF's ability to call > > functions: logging and Tycho's user notifier thing. > > Worth noting that there is one additional thing that I didn't > implement, but which would be nice and is probably not possible with > eBPF (at least, not without a bunch of additional infrastructure): > passing fds back to the tracee from the manager if you intercept > socket(), or accept() or something. > > This could again be accomplished via other means, though it would be a > lot nicer to have a primitive for it. there is bpf_perf_event_output() interface that allows to stream arbitrary data from kernel into user space via perf ring buffer. User space can epoll on it. We use this in both tracing and networking for notifications and streaming data transfers. I suspect this can be used for 'logging' too, since it's cheap and fast. Specifically for android we added bpf_lsm hooks, cookie/uid helpers, and read-only maps. Lorenzo, there was a claim in this thread that bpf is disabled on android. Can you please clarify ? If it's actually disabled and there is no intent to enable it, I'd rather not add any more android specific features to bpf. What I think is important to understand is that BPF goes through very active development. The verifier is constantly getting smarter. There is work to add bounded loops, lock/unlock, get/put tracking, global/percpu variables, dynamic linking and so on. Most of the features are available to root only and unpriv has very limited set. Like getting bpf_perf_event_output() to work for unpriv will likely require additional verifier work. So all cool bits will not be usable by seccomp+eBPF and unpriv on day one. It's not a lot of work either, but once it's done I'd hate to see arguments against adding more verifier features just because eBPF is used by seccomp/landlock/other_security_thing. Also I think the argument that seccomp+eBPF will be faster than seccomp+cBPF is a weak one. I bet kpti on/off makes no difference under seccomp, since _all_ syscalls are already slow for sandboxed app. Instead of making seccomp 5% faster with eBPF, I think it's worth looking into extending LSM hooks to cover all syscalls and have programmable (bpf or whatever) filtering applied per syscall. Like we can have a white list syscall table covered by lsm hooks and any other syscall will get into old seccomp-style filtering category automatically. lsm+bpf would need to follow process hierarchy. It shouldn't be a runtime check at syscall entry either, but compile time extra branch in SYSCALL_DEFINE for non-whitelisted syscalls. There are bunch of other things to figure out, but I think the perf win will be bigger than replacing cBPF with eBPF in existing seccomp. From lorenzo at google.com Thu Feb 15 08:35:07 2018 From: lorenzo at google.com (Lorenzo Colitti) Date: Thu, 15 Feb 2018 17:35:07 +0900 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> <20180214173222.kvos6izqcywkuyi5@cisco> <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> Message-ID: On Thu, Feb 15, 2018 at 1:30 PM, Alexei Starovoitov wrote: > Specifically for android we added bpf_lsm hooks, cookie/uid helpers, > and read-only maps. > Lorenzo, > there was a claim in this thread that bpf is disabled on android. > Can you please clarify ? It's not compiled out, at least at the moment. https://android.googlesource.com/kernel/configs/+/master/android-4.9/android-base.cfg has CONFIG_BPF_SYSCALL=y. As with many things on Android, use of EBPF is (heavily) restricted via selinux, and I'm not aware of any plans to allow unprivileged applications to use EBPF, or even or any usecases other than network accounting. Even for this use case, we're looking at having the program being completely read-only and baked into the system image. I definitely don't have a complete view of things though. Also, bear in mind that none of this code has been released yet, so things could change. From mszeredi at redhat.com Thu Feb 15 08:46:51 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Thu, 15 Feb 2018 09:46:51 +0100 Subject: [PATCH 10/11] fuse: Allow user namespace mounts In-Reply-To: References: Message-ID: On Wed, Feb 14, 2018 at 2:44 PM, Miklos Szeredi wrote: > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >> From: Seth Forshee >> >> To be able to mount fuse from non-init user namespaces, it's necessary >> to set FS_USERNS_MOUNT flag to fs_flags. >> >> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/ >> >> Cc: linux-fsdevel at vger.kernel.org >> Cc: linux-kernel at vger.kernel.org >> Cc: Miklos Szeredi >> Signed-off-by: Seth Forshee >> [dongsu: add a simple commit messasge] >> Signed-off-by: Dongsu Park >> --- >> fs/fuse/inode.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c >> index 7f6b2e55..8c98edee 100644 >> --- a/fs/fuse/inode.c >> +++ b/fs/fuse/inode.c >> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) >> static struct file_system_type fuse_fs_type = { >> .owner = THIS_MODULE, >> .name = "fuse", >> - .fs_flags = FS_HAS_SUBTYPE, >> + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT, >> .mount = fuse_mount, >> .kill_sb = fuse_kill_sb_anon, >> }; > > I think enabling FS_USERNS_MOUNT should be pretty safe. > > I was thinking opting out should be as simple as "chmod o-rw > /dev/fuse". But that breaks libfuse, even though fusermount opens > /dev/fuse in privileged mode, so it shouldn't. I'm talking rubbish, /dev/fuse is opened without privs in fusermount as well. So there's not way to differentiate user_ns unpriv mounts from suid fusermount unpriv mounts. Maybe that's just as well... Thanks, Miklos From noreply at taskvip.com Thu Feb 15 10:45:54 2018 From: noreply at taskvip.com (Canadian-Pharmacy) Date: Thu, 15 Feb 2018 05:45:54 -0500 Subject: We are not trying to get your money; we are just trying to ensure you get effective drugs! Message-ID: <4C128BFD.9042777@taskvip.com> Trusted delivery. Friendly service! ENTER HERE From christian.brauner at canonical.com Thu Feb 15 14:48:56 2018 From: christian.brauner at canonical.com (Christian Brauner) Date: Thu, 15 Feb 2018 15:48:56 +0100 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180214152958.cjgwh2k52zji2jxk@cisco> Message-ID: <20180215144855.GA16088@gmail.com> On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote: > On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen wrote: > > Hey Kees, > > > > Thanks for taking a look! > > > > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: > >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > >> > This patch introduces a means for syscalls matched in seccomp to notify > >> > some other task that a particular filter has been triggered. > >> > > >> > The motivation for this is primarily for use with containers. For example, > >> > if a container does an init_module(), we obviously don't want to load this > >> > untrusted code, which may be compiled for the wrong version of the kernel > >> > anyway. Instead, we could parse the module image, figure out which module > >> > the container is trying to load and load it on the host. > >> > > >> > As another example, containers cannot mknod(), since this checks > >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > >> > coding some whitelist in the kernel. Another example is mount(), which has > >> > many security restrictions for good reason, but configuration or runtime > >> > knowledge could potentially be used to relax these restrictions. > >> > >> Related to the eBPF seccomp thread, can the logic for these things be > >> handled entirely by eBPF? My assumption is that you still need to stop > >> the process to do something (i.e. do a mknod, or a mount) before > >> letting it continue. Is there some "wait for notification" system in > >> eBPF? > > > > I replied in the other thread > > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those > > following along at home), but no, at least not that I know of. > > eBPF can call functions. One of those functions could put the caller > to sleep. In fact, I think I once proposed doing this for the seccomp > logging action as well. > > >> I wonder if this communication should be netlink, which gives a more > >> well-structured way to describe what's on the wire? The reason I ask > >> is because if we ever change the seccomp_data structure, we'll now > >> have two places where we need to deal with it (the first being within > >> the BPF itself). My initial idea was to prefix the communication with > >> a size field, then send the structure, and then I had nightmares, and > >> realized this was basically netlink reinvented. > > > > I suggested netlink in LA, and everyone (especially Andy) groaned very > > loudly :). I'm happy to switch it to netlink if you like, although i > > think memcpy() of structs should be safe here, since the return value > > from read or write can indicate the size of things. > > I could easily get on board with "netlink" (i.e. NLA) messages sent > over an fd. I will object strongly to the use of netlink *sockets*. I think sending netlink messages makes perfect sense here although we burden userspace with all those nice macros to parse these messages. Are there already other cases where userspace gets netlink messages on fds without having opened a netlink socket. > > > > >> An ERRNO filter would block a USER_NOTIF because it's unconditional. > >> TRACE could be either, USER_NOTIF could be either. > >> > >> This means TRACE rules would be bumped by a USER_NOTIF... hmm. > > > > Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all > > seemed more important than USER_NOTIF, but TRACE didn't. I don't have > > a strong opinion about what to do here, because users can adjust their > > filters accordingly. Let me know what you prefer. > > If we switched to eBPF functions, this whole issue goes away. From luto at amacapital.net Thu Feb 15 16:05:18 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Thu, 15 Feb 2018 08:05:18 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> <20180214173222.kvos6izqcywkuyi5@cisco> <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> Message-ID: <17F5A58C-AEE3-4E99-A0F9-313533109FD5@amacapital.net> > On Feb 14, 2018, at 8:30 PM, Alexei Starovoitov wrote: > > On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote: >>>> >>>> What's the reason for adding eBPF support? seccomp shouldn't need it, >>>> and it only makes the code more complex. I'd rather stick with cBPF >>>> until we have an overwhelmingly good reason to use eBPF as a "native" >>>> seccomp filter language. >>>> >>> >>> I can think of two fairly strong use cases for eBPF's ability to call >>> functions: logging and Tycho's user notifier thing. >> >> Worth noting that there is one additional thing that I didn't >> implement, but which would be nice and is probably not possible with >> eBPF (at least, not without a bunch of additional infrastructure): >> passing fds back to the tracee from the manager if you intercept >> socket(), or accept() or something. >> >> This could again be accomplished via other means, though it would be a >> lot nicer to have a primitive for it. > > there is bpf_perf_event_output() interface that allows to stream > arbitrary data from kernel into user space via perf ring buffer. > User space can epoll on it. We use this in both tracing and networking > for notifications and streaming data transfers. > I suspect this can be used for 'logging' too, since it's cheap and fast. I think this is the right idea but we'd want to tweak it. We don't want the log messages to go to some systemwide buffer (seccomp can already so this and its annoying) -- we want them to go to the filter's creator. In fact, the seccomp listener fd concept could easily be extended to do exactly this. > > Also I think the argument that seccomp+eBPF will be faster than > seccomp+cBPF is a weak one. I bet kpti on/off makes no difference > under seccomp, since _all_ syscalls are already slow for sandboxed app. It's been a while since I benchmarked it, but I suspect that a simple seccomp filter is quite a bit faster than a PTI transition. From ebiederm at xmission.com Fri Feb 16 18:26:59 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Fri, 16 Feb 2018 12:26:59 -0600 Subject: plan9 semantics on Linux - mount namespaces In-Reply-To: <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> (Enrico Weigelt's message of "Tue, 13 Feb 2018 22:19:48 +0000") References: <0f058286-a432-379b-f559-f2fe713807ab@metux.net> <5633d335-3926-d98f-d6d7-948b1e2a0b2c@metux.net> Message-ID: <87po54x024.fsf@xmission.com> Enrico Weigelt writes: > On 13.02.2018 22:12, Enrico Weigelt wrote: > > CC @containers at lists.linux-foundation.org > >> Hi folks, >> >> >> I'm currently trying to implement plan9 semantics on Linux and >> yet sorting out how to do the mount namespace handling. >> >> On plan9, any unprivileged process can create its own namespace >> and mount/bind at will, while on Linux this requires CAP_SYS_ADMIN. >> >> What is the reason for not allowing arbitrary users to create their >> own private mount namespace ? What could go wrong here ? suid root executables could be fooled. An easy case is fooling /bin/su into reading a different copy of /etc/shadow, and allowing arbitrary changes between users. >> IMHO, we could allow mount/bind under the following conditions: >> >> * the process is in a private mount namespace >> * no suid-flag is honored (either force all mounts to nosuid or >> ? completely mask it out) >> * only certain whitelisted filesystems allowed (eg. 9P and FUSE) >> >> Maybe that all could be enabled by a new capability. >> >> >> any suggestions ? User namespaces limit the contained processes to not having any permissions outside of the user namespace. While still allowing the fully unix permission model inside user namespaces. I am in the final stages of getting the changes in the vfs and in fuse to allow unprivileged users to mount that filesystem. plan9fs would also be a candidate for that kind of treatment if it had a maintainer. Eric From sargun at sargun.me Fri Feb 16 18:39:24 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Fri, 16 Feb 2018 10:39:24 -0800 Subject: [PATCH net-next 0/3] eBPF Seccomp filters In-Reply-To: <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> References: <20180213154244.GA3292@ircssh-2.c.rugged-nimbus-611.internal> <20180214173222.kvos6izqcywkuyi5@cisco> <20180215043027.zssmhvfdn7iz3rlz@ast-mbp.dhcp.thefacebook.com> Message-ID: On Wed, Feb 14, 2018 at 8:30 PM, Alexei Starovoitov wrote: > On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote: >> > > >> > > What's the reason for adding eBPF support? seccomp shouldn't need it, >> > > and it only makes the code more complex. I'd rather stick with cBPF >> > > until we have an overwhelmingly good reason to use eBPF as a "native" >> > > seccomp filter language. >> > > >> > >> > I can think of two fairly strong use cases for eBPF's ability to call >> > functions: logging and Tycho's user notifier thing. >> >> Worth noting that there is one additional thing that I didn't >> implement, but which would be nice and is probably not possible with >> eBPF (at least, not without a bunch of additional infrastructure): >> passing fds back to the tracee from the manager if you intercept >> socket(), or accept() or something. >> >> This could again be accomplished via other means, though it would be a >> lot nicer to have a primitive for it. > > there is bpf_perf_event_output() interface that allows to stream > arbitrary data from kernel into user space via perf ring buffer. > User space can epoll on it. We use this in both tracing and networking > for notifications and streaming data transfers. > I suspect this can be used for 'logging' too, since it's cheap and fast. > > Specifically for android we added bpf_lsm hooks, cookie/uid helpers, > and read-only maps. > Lorenzo, > there was a claim in this thread that bpf is disabled on android. > Can you please clarify ? > If it's actually disabled and there is no intent to enable it, > I'd rather not add any more android specific features to bpf. > > What I think is important to understand is that BPF goes through > very active development. The verifier is constantly getting smarter. > There is work to add bounded loops, lock/unlock, get/put tracking, > global/percpu variables, dynamic linking and so on. > Most of the features are available to root only and unpriv > has very limited set. Like getting bpf_perf_event_output() to work > for unpriv will likely require additional verifier work. > > So all cool bits will not be usable by seccomp+eBPF and unpriv > on day one. It's not a lot of work either, but once it's done > I'd hate to see arguments against adding more verifier features > just because eBPF is used by seccomp/landlock/other_security_thing. > > Also I think the argument that seccomp+eBPF will be faster than > seccomp+cBPF is a weak one. I bet kpti on/off makes no difference > under seccomp, since _all_ syscalls are already slow for sandboxed app. > Instead of making seccomp 5% faster with eBPF, I think it's > worth looking into extending LSM hooks to cover all syscalls and > have programmable (bpf or whatever) filtering applied per syscall. > Like we can have a white list syscall table covered by lsm hooks > and any other syscall will get into old seccomp-style > filtering category automatically. > lsm+bpf would need to follow process hierarchy. It shouldn't be > a runtime check at syscall entry either, but compile time > extra branch in SYSCALL_DEFINE for non-whitelisted syscalls. > There are bunch of other things to figure out, but I think > the perf win will be bigger than replacing cBPF with eBPF in > existing seccomp. > Given this test program: for (i = 10; i < 99999999; i++) syscall(__NR_getpid); If I implement an eBPF filter with PROG_ARRAYs, and tail call, the numbers are such: ebpf JIT 12.3% slower than native ebpf no JIT 13.6% slower than native seccomp JIT 17.6% slower than native seccomp no JIT 37% slower than native This is using libseccomp for the standard seccomp BPF program. There's no reasonable way for our workload to know which syscalls come "earlier", so we can't take that optimization. Potentially, libseccomp can be smarter about ordering cases (using ranges), and use an O(log(n)) search algorithm, but both of these are microptimizations that scale with the number of syscalls and per-syscall rules. The nicety of using a PROG_ARRAY means that adding additional filters (syscalls) comes at no cost, whereas there's a tradeoff any time you add another rule in traditional seccomp filters. This was tested on an Amazon M4.16XL running with pcid, and KPTI. From ebiederm at xmission.com Fri Feb 16 21:52:32 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Fri, 16 Feb 2018 15:52:32 -0600 Subject: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: (Miklos Szeredi's message of "Tue, 13 Feb 2018 11:20:07 +0100") References: <87lgfy5fpd.fsf@xmission.com> Message-ID: <87606wtxen.fsf@xmission.com> Miklos Szeredi writes: > On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman > wrote: >> Miklos Szeredi writes: >> >>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >>>> From: Seth Forshee >>>> >>>> In order to support mounts from namespaces other than >>>> init_user_ns, fuse must translate uids and gids to/from the >>>> userns of the process servicing requests on /dev/fuse. This >>>> patch does that, with a couple of restrictions on the namespace: >>>> >>>> - The userns for the fuse connection is fixed to the namespace >>>> from which /dev/fuse is opened. >>>> >>>> - The namespace must be the same as s_user_ns. >>>> >>>> These restrictions simplify the implementation by avoiding the >>>> need to pass around userns references and by allowing fuse to >>>> rely on the checks in inode_change_ok for ownership changes. >>>> Either restriction could be relaxed in the future if needed. >>> >>> Can we not introduce potential userspace interface regressions? >>> >>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse: >>> allow server to run in different pid_ns") will probably bite us here >>> as well. >> >> Maybe, but unlike the pid namespace no one has been able to mount >> fuse outside of init_user_ns so we are much less exposed. I agree we >> should be careful. > > Have to wrap my head around all the rules here. > > There's the may_mount() one: > > ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN) > > Um, first of all, why isn't it checking current->cred->user_ns? > > Ah, there it is in sget(): > > ns_capable(user_ns, CAP_SYS_ADMIN) > > I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs > doesn't have FS_USERNS_MOUNT. This is the one that prevents fuse > mounts from being created when (current->cred->user_ns != > &init_user_ns). > > Maybe there's a logic to this web of namespaces, but I don't yet see > it. Is it documented somewhere? I think this is a bit simpler than the fiddly details in the implementation might make it look. The fundamental idea is that permission to have full control over a mount namespace, is different than permission to have full control over an instance of a filesystem. Implementing that separation of permission checks gets a little bit fiddly. The first challenge is that there are several filesystems like sysfs and proc whose internal mount is created outside of a process. Then there are the file systems like nfs and afs that have ``referral points'' that transition you to other instances of those filesystems when you transition over them. That is the reason why there are exceptions for SB_KERNMOUNT and SB_SUBMOUNT. may_mount is just the permission check for the mount namespace. It checks that the current process has CAP_SYS_ADMIN in the user namespace that owns the current mount namespace. AKA is the process allowed to change the mount namespace. sget is just the permission check for mounting a filesystem. It checks that the mounter has CAP_SYS_ADMIN over the user namespace that will own the newly mounted filesystem. By the time execition gets to to sget_userns in general all of the permission checks have all been made. But if the filesystem is not one that supports mounting within a user namespace the code checks capable(CAP_SYS_ADMIN). That is more convoluted than I would like but the checks derive from the definition of what we are doing. > >>> We basically need two modes of operation: >>> >>> a) old, backward compatible (not introducing any new failure mores), >>> created with privileged mount >>> b) new, non-backward compatible, created with unprivileged mount >>> >>> Technically there would still be a risk from breaking userspace, since >>> we are using the same entry point for both, but let's hope that no >>> practical problems come from that. >> >> Answering from a 10,000 foot perspective: >> >> There are two cases. Requests to read/write the filesystem from outside >> of s_user_ns. These run no risk of breaking userspace as this mode has >> not been implemented before. > > This comes from the fact that (s_user_ns == &init_user_ns) and all > user namespaces are "inside" init_user_ns, right? Yes. > One question: why does current code use the from_[ug]id_munged() > variant, when the conversion can never fail. Or can it? There is always at least (uid_t)-1 that can fail if it shows up on a filesystem. As far as I can tell no one was using it for a uid, there were already uses of (uid_t)-1 as a special case, and I just grabbed it to become INVALID_UID. In practice the mapping can't fail unless someone malicious starts using that id. I believe I picked the _munged variant so in case that version hits we are guaranteed to return the 16bit nobody user. Eric From ebiederm at xmission.com Fri Feb 16 21:53:19 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Fri, 16 Feb 2018 15:53:19 -0600 Subject: [PATCH v5 00/11] FUSE mounts from non-init user namespaces In-Reply-To: (Miklos Szeredi's message of "Tue, 13 Feb 2018 12:32:09 +0100") References: Message-ID: <87y3jssisw.fsf@xmission.com> Miklos Szeredi writes: > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > >> Patches 1-2 deal with an additional flag of lookup_bdev() to check for >> additional inode permission. > > fuse_blk is less suitable for unprivileged mounting than plain fuse. > fusermount doesn't allow mounting fuse_blk unprivileged, so there's > little data about that usecase (IIRC ntfs3g guys did that, or at least > tried to do it, but I don't remember the details). > > As such, I think we should leave it out of the initial version. Which > means you can drop patches 1-2 from this series. Unless there's a > strong use case for this. In which case we should look hard at the > differences between fuse_blk and fuse and how that affects > unprivileged operation. There are a few assumptions about fuse_blk > filesystem being more "well behaved", I think. Especially to start with I am fine with that. It makes a lot of sense to get the obvious cases first. Eric From ebiederm at xmission.com Fri Feb 16 22:00:53 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Fri, 16 Feb 2018 16:00:53 -0600 Subject: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes In-Reply-To: (Miklos Szeredi's message of "Tue, 13 Feb 2018 14:18:21 +0100") References: Message-ID: <87a7w8siga.fsf@xmission.com> Miklos Szeredi writes: > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >> From: Eric W. Biederman >> >> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to >> chown files. Ordinarily the capable_wrt_inode_uidgid check is >> sufficient to allow access to files but when the underlying filesystem >> has uids or gids that don't map to the current user namespace it is >> not enough, so the chown permission checks need to be extended to >> allow this case. >> >> Calling chown on filesystem nodes whose uid or gid don't map is >> necessary if those nodes are going to be modified as writing back >> inodes which contain uids or gids that don't map is likely to cause >> filesystem corruption of the uid or gid fields. > > How can the filesystem be corrupted if chown is denied? > > It is not clear to me what the purpose of this patch is or what the > exact usecase this is fixing. It isn't a fix and we can delay this one and similar patches that enable things until we are certain all of the necessary restrictions are in place. This is not essential for safely getting fully unprivileged mounting of fuse to work. The overall strategy has been to handle as many of the generic concerns at the vfs level as possible to separate filesystem concerns and generic concerns. In this case the generic concern is what happens when the uid is read from the filesystem and it gets mapped to INVALID_UID and then the inode for that file is written back. That is a trap for the unwary filesystem implementation and not a case that I think anyone will actually care about. It is just not useful to mount a filesystem and to not map some of it's ids. So the generic vfs code just denies writes to files like show with uid of INVALID_UID or gid of INVALID_GID. Just to ensure that problems don't show up. This patch gets through those defenses. Eric From sargun at sargun.me Sat Feb 17 06:29:32 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Fri, 16 Feb 2018 22:29:32 -0800 Subject: [PATCH net-next 1/3] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: References: <20180213154255.GA3301@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Tue, Feb 13, 2018 at 12:34 PM, Kees Cook wrote: > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon wrote: >> From: Sargun Dhillon >> >> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant >> to be used for seccomp filters as an alternative to cBPF filters. The >> program type has relatively limited capabilities in terms of helpers, >> but that can be extended later on. >> >> It also introduces a new mechanism to attach these filters via the >> prctl and seccomp syscalls -- SECCOMP_MODE_FILTER_EXTENDED, and >> SECCOMP_SET_MODE_FILTER_EXTENDED respectively. >> >> Signed-off-by: Sargun Dhillon >> --- >> arch/Kconfig | 7 ++ >> include/linux/bpf_types.h | 3 + >> include/uapi/linux/bpf.h | 2 + >> include/uapi/linux/seccomp.h | 15 +++-- >> kernel/bpf/syscall.c | 1 + >> kernel/seccomp.c | 148 +++++++++++++++++++++++++++++++++++++------ >> 6 files changed, 150 insertions(+), 26 deletions(-) >> >> diff --git a/arch/Kconfig b/arch/Kconfig >> index 76c0b54443b1..db675888577c 100644 >> --- a/arch/Kconfig >> +++ b/arch/Kconfig >> @@ -401,6 +401,13 @@ config SECCOMP_FILTER >> >> See Documentation/prctl/seccomp_filter.txt for details. >> >> +config SECCOMP_FILTER_EXTENDED >> + bool "Extended BPF seccomp filters" >> + depends on SECCOMP_FILTER && BPF_SYSCALL >> + help >> + Enables seccomp filters to be written in eBPF, as opposed >> + to just cBPF filters. >> + >> config HAVE_GCC_PLUGINS >> bool >> help >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h >> index 19b8349a3809..945c65c4e461 100644 >> --- a/include/linux/bpf_types.h >> +++ b/include/linux/bpf_types.h >> @@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) >> #ifdef CONFIG_CGROUP_BPF >> BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) >> #endif >> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED >> +BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp) >> +#endif >> >> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index db6bdc375126..5f96cb7ed954 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -1,3 +1,4 @@ >> + >> /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ >> /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com >> * >> @@ -133,6 +134,7 @@ enum bpf_prog_type { >> BPF_PROG_TYPE_SOCK_OPS, >> BPF_PROG_TYPE_SK_SKB, >> BPF_PROG_TYPE_CGROUP_DEVICE, >> + BPF_PROG_TYPE_SECCOMP, >> }; >> >> enum bpf_attach_type { >> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h >> index 2a0bd9dd104d..7da8b39f2a6a 100644 >> --- a/include/uapi/linux/seccomp.h >> +++ b/include/uapi/linux/seccomp.h >> @@ -7,14 +7,17 @@ >> >> >> /* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, ) */ >> -#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ >> -#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ >> -#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ >> +#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ >> +#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ >> +#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ >> +#define SECCOMP_MODE_FILTER_EXTENDED 3 /* uses eBPF filter from fd */ > > This MODE flag isn't needed: it's still using a filter, and the > interface changes should be sufficient with > SECCOMP_SET_MODE_FILTER_EXTENDED below. > >> /* Valid operations for seccomp syscall. */ >> -#define SECCOMP_SET_MODE_STRICT 0 >> -#define SECCOMP_SET_MODE_FILTER 1 >> -#define SECCOMP_GET_ACTION_AVAIL 2 >> +#define SECCOMP_SET_MODE_STRICT 0 >> +#define SECCOMP_SET_MODE_FILTER 1 >> +#define SECCOMP_GET_ACTION_AVAIL 2 >> +#define SECCOMP_SET_MODE_FILTER_EXTENDED 3 > > It seems like this should be a FILTER flag, not a syscall op change? > >> + >> >> /* Valid flags for SECCOMP_SET_MODE_FILTER */ >> #define SECCOMP_FILTER_FLAG_TSYNC 1 >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index e24aa3241387..86d6ec8b916d 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -1202,6 +1202,7 @@ static int bpf_prog_load(union bpf_attr *attr) >> >> if (type != BPF_PROG_TYPE_SOCKET_FILTER && >> type != BPF_PROG_TYPE_CGROUP_SKB && >> + type != BPF_PROG_TYPE_SECCOMP && >> !capable(CAP_SYS_ADMIN)) >> return -EPERM; > > So only init_ns-CAP_SYS_ADMIN would be able to use seccomp eBPF? > No, this is specifically so non-init CAP_SYS_ADMIN cal load BPF filters that are either socket_filter, cgroup_skb, or seccomp. >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index 940fa408a288..b30dd25c1cb8 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -37,6 +37,7 @@ >> #include >> #include >> #include >> +#include >> >> /** >> * struct seccomp_filter - container for seccomp BPF programs >> @@ -367,17 +368,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) >> >> BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); >> >> - /* >> - * Installing a seccomp filter requires that the task has >> - * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. >> - * This avoids scenarios where unprivileged tasks can affect the >> - * behavior of privileged children. >> - */ >> - if (!task_no_new_privs(current) && >> - security_capable_noaudit(current_cred(), current_user_ns(), >> - CAP_SYS_ADMIN) != 0) >> - return ERR_PTR(-EACCES); >> - >> /* Allocate a new seccomp_filter */ >> sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); >> if (!sfilter) >> @@ -423,6 +413,48 @@ seccomp_prepare_user_filter(const char __user *user_filter) >> return filter; >> } >> >> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED >> +/** >> + * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd >> + * @user_filter: pointer to the user data containing an fd. >> + * >> + * Returns 0 on success and non-zero otherwise. >> + */ >> +static struct seccomp_filter * >> +seccomp_prepare_extended_filter(const char __user *user_fd) >> +{ >> + struct seccomp_filter *sfilter; >> + struct bpf_prog *fp; >> + int fd; >> + >> + /* Fetch the fd from userspace */ >> + if (get_user(fd, (int __user *)user_fd)) >> + return ERR_PTR(-EFAULT); >> + >> + /* Allocate a new seccomp_filter */ >> + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); >> + if (!sfilter) >> + return ERR_PTR(-ENOMEM); >> + >> + fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP); >> + if (IS_ERR(fp)) { >> + kfree(sfilter); >> + return ERR_CAST(fp); >> + } >> + >> + sfilter->prog = fp; >> + refcount_set(&sfilter->usage, 1); >> + >> + return sfilter; >> +} >> +#else >> +static struct seccomp_filter * >> +seccomp_prepare_extended_filter(const char __user *filter_fd) >> +{ >> + return ERR_PTR(-EINVAL); >> +} >> +#endif >> + >> /** >> * seccomp_attach_filter: validate and attach filter >> * @flags: flags to change filter behavior >> @@ -492,7 +524,10 @@ void get_seccomp_filter(struct task_struct *tsk) >> static inline void seccomp_filter_free(struct seccomp_filter *filter) >> { >> if (filter) { >> - bpf_prog_destroy(filter->prog); >> + if (bpf_prog_was_classic(filter->prog)) >> + bpf_prog_destroy(filter->prog); >> + else >> + bpf_prog_put(filter->prog); >> kfree(filter); >> } >> } >> @@ -842,18 +877,36 @@ static long seccomp_set_mode_strict(void) >> * Returns 0 on success or -EINVAL on failure. >> */ >> static long seccomp_set_mode_filter(unsigned int flags, >> - const char __user *filter) >> + const char __user *filter, >> + unsigned long filter_type) > > I think this can just live in flags? > >> { >> - const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; >> + /* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */ >> + const unsigned long filter_mode = SECCOMP_MODE_FILTER; >> struct seccomp_filter *prepared = NULL; >> long ret = -EINVAL; >> >> /* Validate flags. */ >> if (flags & ~SECCOMP_FILTER_FLAG_MASK) >> return -EINVAL; >> + /* >> + * Installing a seccomp filter requires that the task has >> + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. >> + * This avoids scenarios where unprivileged tasks can affect the >> + * behavior of privileged children. >> + */ >> + if (!task_no_new_privs(current) && >> + security_capable_noaudit(current_cred(), current_user_ns(), >> + CAP_SYS_ADMIN) != 0) >> + return -EACCES; > > This changes the order of checks -- before, too-large filters would > get EINVAL even if they lacked the needed permissions. As long as this > doesn't break anything in the real world, it should be fine, but I > might want to instead create a perm-check function and just call it in > both functions. (And likely write a self-test that checks this order, > if it doesn't already exist.) > >> >> /* Prepare the new filter before holding any locks. */ >> - prepared = seccomp_prepare_user_filter(filter); >> + if (filter_type == SECCOMP_SET_MODE_FILTER_EXTENDED) >> + prepared = seccomp_prepare_extended_filter(filter); >> + else if (filter_type == SECCOMP_SET_MODE_FILTER) >> + prepared = seccomp_prepare_user_filter(filter); >> + else >> + return -EINVAL; >> + >> if (IS_ERR(prepared)) >> return PTR_ERR(prepared); >> >> @@ -867,7 +920,7 @@ static long seccomp_set_mode_filter(unsigned int flags, >> >> spin_lock_irq(¤t->sighand->siglock); >> >> - if (!seccomp_may_assign_mode(seccomp_mode)) >> + if (!seccomp_may_assign_mode(filter_mode)) >> goto out; >> >> ret = seccomp_attach_filter(flags, prepared); >> @@ -876,7 +929,7 @@ static long seccomp_set_mode_filter(unsigned int flags, >> /* Do not free the successfully attached filter. */ >> prepared = NULL; >> >> - seccomp_assign_mode(current, seccomp_mode); >> + seccomp_assign_mode(current, filter_mode); > > With a filter flag, the above hunks don't need to be changed, for example. > >> out: >> spin_unlock_irq(¤t->sighand->siglock); >> if (flags & SECCOMP_FILTER_FLAG_TSYNC) >> @@ -926,7 +979,9 @@ static long do_seccomp(unsigned int op, unsigned int flags, >> return -EINVAL; >> return seccomp_set_mode_strict(); >> case SECCOMP_SET_MODE_FILTER: >> - return seccomp_set_mode_filter(flags, uargs); >> + return seccomp_set_mode_filter(flags, uargs, op); >> + case SECCOMP_SET_MODE_FILTER_EXTENDED: >> + return seccomp_set_mode_filter(flags, uargs, op); > > And this isn't needed, since it would be passed as a flag. > >> case SECCOMP_GET_ACTION_AVAIL: >> if (flags != 0) >> return -EINVAL; >> @@ -969,6 +1024,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) >> op = SECCOMP_SET_MODE_FILTER; >> uargs = filter; >> break; >> + case SECCOMP_MODE_FILTER_EXTENDED: >> + op = SECCOMP_SET_MODE_FILTER_EXTENDED; >> + uargs = filter; >> + break; > > Same. > >> default: >> return -EINVAL; >> } >> @@ -1040,8 +1099,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, >> if (IS_ERR(filter)) >> return PTR_ERR(filter); >> >> - fprog = filter->prog->orig_prog; >> - if (!fprog) { >> + if (!bpf_prog_was_classic(filter->prog)) { >> /* This must be a new non-cBPF filter, since we save >> * every cBPF filter's orig_prog above when >> * CONFIG_CHECKPOINT_RESTORE is enabled. >> @@ -1050,6 +1108,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, >> goto out; >> } >> >> + fprog = filter->prog->orig_prog; > > I wonder if it would be easier to review to split eBPF install from > the eBPF "get filter" changes as separate patches? > Yes, will respin. Thanks for your feedback. I appreciate the quick review. >> ret = fprog->len; >> if (!data) >> goto out; >> @@ -1239,6 +1298,55 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, >> return 0; >> } >> >> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED >> +static bool seccomp_is_valid_access(int off, int size, >> + enum bpf_access_type type, >> + struct bpf_insn_access_aux *info) >> +{ >> + if (type != BPF_READ) >> + return false; >> + >> + if (off < 0 || off + size > sizeof(struct seccomp_data)) >> + return false; >> + >> + switch (off) { >> + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): >> + return (size == sizeof(__u64)); >> + case bpf_ctx_range(struct seccomp_data, nr): >> + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); >> + case bpf_ctx_range(struct seccomp_data, arch): >> + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); >> + case bpf_ctx_range(struct seccomp_data, instruction_pointer): >> + return (size == FIELD_SIZEOF(struct seccomp_data, >> + instruction_pointer)); >> + } >> + >> + return false; >> +} >> + >> +static const struct bpf_func_proto * >> +seccomp_func_proto(enum bpf_func_id func_id) >> +{ >> + switch (func_id) { >> + case BPF_FUNC_get_current_uid_gid: >> + return &bpf_get_current_uid_gid_proto; >> + case BPF_FUNC_trace_printk: >> + if (capable(CAP_SYS_ADMIN)) >> + return bpf_get_trace_printk_proto(); >> + default: >> + return NULL; >> + } >> +} > > This makes me so uncomfortable. :) Why is uid/gid needed? Why add > printk support here? (And why is it CAP_SYS_ADMIN checked if the > entire filter is CAP_SYS_ADMIN checked before being attached?) > See comment above. Anyone can load filters. You can load the filter as a normal user, drop privliged, and install the filter later with cap_sys_admin, or no_new_privs. >> + >> +const struct bpf_prog_ops seccomp_prog_ops = { >> +}; >> + >> +const struct bpf_verifier_ops seccomp_verifier_ops = { >> + .get_func_proto = seccomp_func_proto, >> + .is_valid_access = seccomp_is_valid_access, >> +}; >> +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ >> + >> static struct ctl_path seccomp_sysctl_path[] = { >> { .procname = "kernel", }, >> { .procname = "seccomp", }, >> -- >> 2.14.1 >> > > -Kees > > -- > Kees Cook > Pixel Security From sargun at sargun.me Sat Feb 17 07:35:55 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Sat, 17 Feb 2018 07:35:55 +0000 Subject: [net-next v2 0/2] eBPF Seccomp filters Message-ID: <20180217073550.GA8202@ircssh-2.c.rugged-nimbus-611.internal> This patchset enables seccomp filters to be written in eBPF. Although, this patchset doesn't introduce much of the functionality enabled by eBPF, it lays the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE support in order to utilize eBPF seccomp filters, as eBPF filters cannot be retrieved via the ptrace GET_FILTER API. Any user can load a bpf seccomp filter program, and it can be pinned and reused without requiring access to the bpf syscalls. A user only requires the traditional permissions of either being cap_sys_admin, or have no_new_privs set in order to install their rule. The primary reason for not adding maps support in this patchset is to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. If we have a map that the BPF program can read, it can potentially "change" privileges after running. It seems like doing writes only is safe, because it can be pure, and side effect free, and therefore not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come to an agreement, this can be in a follow-up patchset. A benchmark of this patchset is as follows for a very standard eBPF filter: Given this test program: for (i = 10; i < 99999999; i++) syscall(__NR_getpid); If I implement an eBPF filter with PROG_ARRAYs with a program per syscall, and tail call, the numbers are such: ebpf JIT 12.3% slower than native ebpf no JIT 13.6% slower than native seccomp JIT 17.6% slower than native seccomp no JIT 37% slower than native The speed of the traditional seccomp filter increases O(n) with the number of syscalls with discrete rulesets, whereas ebpf is O(1), given any number of syscall filters. Changes since v1: * Use a flag to indicate loading an eBPF filter, not a separate command * Remove printk helper * Remove ptrace patch / restore filter / sample * Add some safe helpers Sargun Dhillon (2): bpf, seccomp: Add eBPF filter capabilities bpf: Add eBPF seccomp sample programs arch/Kconfig | 8 +++ include/linux/bpf_types.h | 3 + include/linux/seccomp.h | 3 +- include/uapi/linux/bpf.h | 2 + include/uapi/linux/seccomp.h | 7 ++- kernel/bpf/syscall.c | 1 + kernel/seccomp.c | 145 +++++++++++++++++++++++++++++++++++++------ samples/bpf/Makefile | 5 ++ samples/bpf/bpf_load.c | 9 ++- samples/bpf/seccomp1_kern.c | 43 +++++++++++++ samples/bpf/seccomp1_user.c | 45 ++++++++++++++ 11 files changed, 247 insertions(+), 24 deletions(-) create mode 100644 samples/bpf/seccomp1_kern.c create mode 100644 samples/bpf/seccomp1_user.c -- 2.14.1 From sargun at sargun.me Sat Feb 17 07:36:08 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Sat, 17 Feb 2018 07:36:08 +0000 Subject: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities Message-ID: <20180217073604.GA8214@ircssh-2.c.rugged-nimbus-611.internal> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant to be used for seccomp filters as an alternative to cBPF filters. The program type has relatively limited capabilities in terms of helpers, but that can be extended later on. The eBPF code loading is separated from attachment of the filter, so a privileged user can load the filter, and pass it back to an unprivileged user who can attach it and use it at a later time. In order to attach the filter itself, you need to supply a flag to the seccomp syscall indicating that a eBPF filter is being attached, as opposed to a cBPF one. Verification occurs at program load time, so the user should only receive errors related to attachment. Signed-off-by: Sargun Dhillon --- arch/Kconfig | 8 +++ include/linux/bpf_types.h | 3 + include/linux/seccomp.h | 3 +- include/uapi/linux/bpf.h | 2 + include/uapi/linux/seccomp.h | 7 ++- kernel/bpf/syscall.c | 1 + kernel/seccomp.c | 145 +++++++++++++++++++++++++++++++++++++------ 7 files changed, 147 insertions(+), 22 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 76c0b54443b1..8490d35e59d6 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -401,6 +401,14 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config SECCOMP_FILTER_EXTENDED + bool "Extended BPF seccomp filters" + depends on SECCOMP_FILTER && BPF_SYSCALL + depends on !CHECKPOINT_RESTORE + help + Enables seccomp filters to be written in eBPF, as opposed + to just cBPF filters. + config HAVE_GCC_PLUGINS bool help diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 19b8349a3809..945c65c4e461 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) #ifdef CONFIG_CGROUP_BPF BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) #endif +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index c723a5c4e3ff..a7df3ba6cf25 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -5,7 +5,8 @@ #include #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ - SECCOMP_FILTER_FLAG_LOG) + SECCOMP_FILTER_FLAG_LOG | \ + SECCOMP_FILTER_FLAG_EXTENDED) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index db6bdc375126..5f96cb7ed954 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1,3 +1,4 @@ + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * @@ -133,6 +134,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_CGROUP_DEVICE, + BPF_PROG_TYPE_SECCOMP, }; enum bpf_attach_type { diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 2a0bd9dd104d..730af6c7ec2e 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -16,10 +16,11 @@ #define SECCOMP_SET_MODE_FILTER 1 #define SECCOMP_GET_ACTION_AVAIL 2 -/* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC 1 -#define SECCOMP_FILTER_FLAG_LOG 2 +/* Valid flags for SECCOMP_SET_MODE_FILTER */ +#define SECCOMP_FILTER_FLAG_TSYNC (1 << 0) +#define SECCOMP_FILTER_FLAG_LOG (1 << 1) +#define SECCOMP_FILTER_FLAG_EXTENDED (1 << 2) /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e24aa3241387..86d6ec8b916d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1202,6 +1202,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && + type != BPF_PROG_TYPE_SECCOMP && !capable(CAP_SYS_ADMIN)) return -EPERM; diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 940fa408a288..f8ddc4af1135 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -37,6 +37,7 @@ #include #include #include +#include /** * struct seccomp_filter - container for seccomp BPF programs @@ -367,17 +368,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); - /* - * Installing a seccomp filter requires that the task has - * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. - * This avoids scenarios where unprivileged tasks can affect the - * behavior of privileged children. - */ - if (!task_no_new_privs(current) && - security_capable_noaudit(current_cred(), current_user_ns(), - CAP_SYS_ADMIN) != 0) - return ERR_PTR(-EACCES); - /* Allocate a new seccomp_filter */ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); if (!sfilter) @@ -423,6 +413,48 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +/** + * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd + * @user_filter: pointer to the user data containing an fd. + * + * Returns 0 on success and non-zero otherwise. + */ +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *user_fd) +{ + struct seccomp_filter *sfilter; + struct bpf_prog *fp; + int fd; + + /* Fetch the fd from userspace */ + if (get_user(fd, (int __user *)user_fd)) + return ERR_PTR(-EFAULT); + + /* Allocate a new seccomp_filter */ + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); + if (!sfilter) + return ERR_PTR(-ENOMEM); + + fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP); + if (IS_ERR(fp)) { + kfree(sfilter); + return ERR_CAST(fp); + } + + sfilter->prog = fp; + refcount_set(&sfilter->usage, 1); + + return sfilter; +} +#else +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *filter_fd) +{ + return ERR_PTR(-EINVAL); +} +#endif + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -492,7 +524,10 @@ void get_seccomp_filter(struct task_struct *tsk) static inline void seccomp_filter_free(struct seccomp_filter *filter) { if (filter) { - bpf_prog_destroy(filter->prog); + if (bpf_prog_was_classic(filter->prog)) + bpf_prog_destroy(filter->prog); + else + bpf_prog_put(filter->prog); kfree(filter); } } @@ -844,7 +879,8 @@ static long seccomp_set_mode_strict(void) static long seccomp_set_mode_filter(unsigned int flags, const char __user *filter) { - const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; + /* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */ + const unsigned long filter_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; @@ -853,10 +889,31 @@ static long seccomp_set_mode_filter(unsigned int flags, return -EINVAL; /* Prepare the new filter before holding any locks. */ - prepared = seccomp_prepare_user_filter(filter); + if (flags & SECCOMP_FILTER_FLAG_EXTENDED) + prepared = seccomp_prepare_extended_filter(filter); + else + prepared = seccomp_prepare_user_filter(filter); + if (IS_ERR(prepared)) return PTR_ERR(prepared); + /* + * Installing a seccomp filter requires that the task has + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. + * This avoids scenarios where unprivileged tasks can affect the + * behavior of privileged children. + * + * This is checked after filter preparation because the user + * will get an EINVAL if their filter is invalid prior to the + * EPERM. + */ + if (!task_no_new_privs(current) && + security_capable_noaudit(current_cred(), current_user_ns(), + CAP_SYS_ADMIN) != 0) { + ret = -EACCES; + goto out_free; + } + /* * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags, spin_lock_irq(¤t->sighand->siglock); - if (!seccomp_may_assign_mode(seccomp_mode)) + if (!seccomp_may_assign_mode(filter_mode)) goto out; ret = seccomp_attach_filter(flags, prepared); @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags, /* Do not free the successfully attached filter. */ prepared = NULL; - seccomp_assign_mode(current, seccomp_mode); + seccomp_assign_mode(current, filter_mode); out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, if (IS_ERR(filter)) return PTR_ERR(filter); - fprog = filter->prog->orig_prog; - if (!fprog) { + if (!bpf_prog_was_classic(filter->prog)) { /* This must be a new non-cBPF filter, since we save * every cBPF filter's orig_prog above when * CONFIG_CHECKPOINT_RESTORE is enabled. @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, goto out; } + fprog = filter->prog->orig_prog; ret = fprog->len; if (!data) goto out; @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, return 0; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +static bool seccomp_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (type != BPF_READ) + return false; + + if (off < 0 || off + size > sizeof(struct seccomp_data)) + return false; + + switch (off) { + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): + return (size == sizeof(__u64)); + case bpf_ctx_range(struct seccomp_data, nr): + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); + case bpf_ctx_range(struct seccomp_data, arch): + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); + case bpf_ctx_range(struct seccomp_data, instruction_pointer): + return (size == FIELD_SIZEOF(struct seccomp_data, + instruction_pointer)); + } + + return false; +} + +static const struct bpf_func_proto * +seccomp_func_proto(enum bpf_func_id func_id) +{ + switch (func_id) { + case BPF_FUNC_get_current_uid_gid: + return &bpf_get_current_uid_gid_proto; + case BPF_FUNC_ktime_get_ns: + return &bpf_ktime_get_ns_proto; + case BPF_FUNC_get_prandom_u32: + return &bpf_get_prandom_u32_proto; + case BPF_FUNC_get_current_pid_tgid: + return &bpf_get_current_pid_tgid_proto; + default: + return NULL; + } +} + +const struct bpf_prog_ops seccomp_prog_ops = { +}; + +const struct bpf_verifier_ops seccomp_verifier_ops = { + .get_func_proto = seccomp_func_proto, + .is_valid_access = seccomp_is_valid_access, +}; +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ + static struct ctl_path seccomp_sysctl_path[] = { { .procname = "kernel", }, { .procname = "seccomp", }, -- 2.14.1 From sargun at sargun.me Sat Feb 17 07:36:20 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Sat, 17 Feb 2018 07:36:20 +0000 Subject: [net-next v2 2/2] bpf: Add eBPF seccomp sample programs Message-ID: <20180217073617.GA8226@ircssh-2.c.rugged-nimbus-611.internal> This adds a sample program that uses seccomp-eBPF, called seccomp1. It shows the simple ability to code seccomp filters in C. Signed-off-by: Sargun Dhillon --- samples/bpf/Makefile | 5 +++++ samples/bpf/bpf_load.c | 9 +++++++-- samples/bpf/seccomp1_kern.c | 43 +++++++++++++++++++++++++++++++++++++++++++ samples/bpf/seccomp1_user.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 100 insertions(+), 2 deletions(-) create mode 100644 samples/bpf/seccomp1_kern.c create mode 100644 samples/bpf/seccomp1_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ec3fc8d88e87..264838846f71 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -43,6 +43,7 @@ hostprogs-y += xdp_redirect_cpu hostprogs-y += xdp_monitor hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp +hostprogs-y += seccomp1 # Libbpf dependencies LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o @@ -93,6 +94,8 @@ xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o +seccomp1-objs := bpf_load.o $(LIBBPF) seccomp1_user.o + # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -144,6 +147,7 @@ always += xdp_monitor_kern.o always += xdp_rxq_info_kern.o always += xdp2skb_meta_kern.o always += syscall_tp_kern.o +always += seccomp1_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -188,6 +192,7 @@ HOSTLOADLIBES_xdp_redirect_cpu += -lelf HOSTLOADLIBES_xdp_monitor += -lelf HOSTLOADLIBES_xdp_rxq_info += -lelf HOSTLOADLIBES_syscall_tp += -lelf +HOSTLOADLIBES_seccomp1 += -lelf # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 69806d74fa53..856bc8b93916 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -67,6 +67,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0; bool is_sockops = strncmp(event, "sockops", 7) == 0; bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0; + bool is_seccomp = strncmp(event, "seccomp", 7) == 0; size_t insns_cnt = size / sizeof(struct bpf_insn); enum bpf_prog_type prog_type; char buf[256]; @@ -96,6 +97,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_type = BPF_PROG_TYPE_SOCK_OPS; } else if (is_sk_skb) { prog_type = BPF_PROG_TYPE_SK_SKB; + } else if (is_seccomp) { + prog_type = BPF_PROG_TYPE_SECCOMP; } else { printf("Unknown event '%s'\n", event); return -1; @@ -110,7 +113,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_fd[prog_cnt++] = fd; - if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk) + if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk || + is_seccomp) return 0; if (is_socket || is_sockops || is_sk_skb) { @@ -589,7 +593,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map) memcmp(shname, "socket", 6) == 0 || memcmp(shname, "cgroup/", 7) == 0 || memcmp(shname, "sockops", 7) == 0 || - memcmp(shname, "sk_skb", 6) == 0) { + memcmp(shname, "sk_skb", 6) == 0 || + memcmp(shname, "seccomp", 7) == 0) { ret = load_and_attach(shname, data->d_buf, data->d_size); if (ret != 0) diff --git a/samples/bpf/seccomp1_kern.c b/samples/bpf/seccomp1_kern.c new file mode 100644 index 000000000000..420e37eebd92 --- /dev/null +++ b/samples/bpf/seccomp1_kern.c @@ -0,0 +1,43 @@ +#include +#include +#include +#include "bpf_helpers.h" +#include +#include + +#if defined(__x86_64__) +#define ARCH AUDIT_ARCH_X86_64 +#elif defined(__i386__) +#define ARCH AUDIT_ARCH_I386 +#else +#endif + +#ifdef ARCH +/* Returns EPERM when trying to close fd 999 */ +SEC("seccomp") +int bpf_prog1(struct seccomp_data *ctx) +{ + /* + * Make sure this BPF program is being run on the same architecture it + * was compiled on. + */ + if (ctx->arch != ARCH) + return SECCOMP_RET_ERRNO | EPERM; + if (ctx->nr == __NR_close && ctx->args[0] == 999) + return SECCOMP_RET_ERRNO | EPERM; + + return SECCOMP_RET_ALLOW; +} +#else +#warning Architecture not supported -- Blocking all syscalls +SEC("seccomp") +int bpf_prog1(struct seccomp_data *ctx) +{ + return SECCOMP_RET_ERRNO | EPERM; +} +#endif + + + + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/seccomp1_user.c b/samples/bpf/seccomp1_user.c new file mode 100644 index 000000000000..b4951e0ca56f --- /dev/null +++ b/samples/bpf/seccomp1_user.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" +#include +#include +#include +#include +#include +#include + +int main(int argc, char **argv) +{ + char filename[256]; + + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + /* set new_new_privs so non-privileged users can attach filters */ + if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { + perror("prctl(NO_NEW_PRIVS)"); + return 1; + } + + if (syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, + SECCOMP_FILTER_FLAG_EXTENDED, &prog_fd)) { + perror("seccomp"); + return 1; + } + + close(111); + assert(errno == EBADF); + close(999); + assert(errno = EPERM); + + return 0; +} -- 2.14.1 From rdunlap at infradead.org Sat Feb 17 17:58:16 2018 From: rdunlap at infradead.org (Randy Dunlap) Date: Sat, 17 Feb 2018 09:58:16 -0800 Subject: [net-next v2 2/2] bpf: Add eBPF seccomp sample programs In-Reply-To: <20180217073617.GA8226@ircssh-2.c.rugged-nimbus-611.internal> References: <20180217073617.GA8226@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On 02/16/2018 11:36 PM, Sargun Dhillon wrote: > + close(111); > + assert(errno == EBADF); > + close(999); > + assert(errno = EPERM); should that be == ? > + > + return 0; > +} -- ~Randy From noreply at cervunal.com Sun Feb 18 22:21:21 2018 From: noreply at cervunal.com (24-7-Drugs) Date: Mon, 19 Feb 2018 00:21:21 +0200 Subject: Our most cherished customers enjoy all possible privileges and discounts we have to offer! Message-ID: Incredible service. Efficient delivery! ENTER HERE From sargun at sargun.me Mon Feb 19 04:05:35 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Sun, 18 Feb 2018 20:05:35 -0800 Subject: [net-next v2 2/2] bpf: Add eBPF seccomp sample programs In-Reply-To: References: <20180217073617.GA8226@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: On Sat, Feb 17, 2018 at 9:58 AM, Randy Dunlap wrote: > On 02/16/2018 11:36 PM, Sargun Dhillon wrote: >> + close(111); >> + assert(errno == EBADF); >> + close(999); >> + assert(errno = EPERM); > > should that be == ? > Woops. Embarassing. Will fix that in the next re-spin. >> + >> + return 0; >> +} > > > -- > ~Randy From lisa.smith at tradeshowprospects.com Mon Feb 19 16:06:55 2018 From: lisa.smith at tradeshowprospects.com (Lisa Smith) Date: Mon, 19 Feb 2018 21:36:55 +0530 Subject: SCALE - Attendee List Message-ID: Hi, Would you be interested in Southern California Linux Expo - SCALE 2018 Attendee list? We can provide you with 3,600 attendee contacts. Each contact comes with First Name, Middle Name, Last Name, Phone, Fax, Email Address, Business Name, Job Title, Web Address/URL, Country and Zip Code. Please let me know if you are interested and I shall get back to you with the Pricing. Best Regards, Lisa Smith | Demand Generation| B2bprolist If you don't wish to receive our newsletters, reply back with " UN-SUBSCRIBE " in subject line. From sargun at sargun.me Mon Feb 19 16:22:02 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 19 Feb 2018 16:22:02 +0000 Subject: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities Message-ID: <20180219162159.GA11474@ircssh-2.c.rugged-nimbus-611.internal> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant to be used for seccomp filters as an alternative to cBPF filters. The program type has relatively limited capabilities in terms of helpers, but that can be extended later on. The eBPF code loading is separated from attachment of the filter, so a privileged user can load the filter, and pass it back to an unprivileged user who can attach it and use it at a later time. In order to attach the filter itself, you need to supply a flag to the seccomp syscall indicating that a eBPF filter is being attached, as opposed to a cBPF one. Verification occurs at program load time, so the user should only receive errors related to attachment. Signed-off-by: Sargun Dhillon --- arch/Kconfig | 8 +++ include/linux/bpf_types.h | 3 + include/linux/seccomp.h | 3 +- include/uapi/linux/bpf.h | 2 + include/uapi/linux/seccomp.h | 7 ++- kernel/bpf/syscall.c | 1 + kernel/seccomp.c | 145 +++++++++++++++++++++++++++++++++++++------ 7 files changed, 147 insertions(+), 22 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 76c0b54443b1..8490d35e59d6 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -401,6 +401,14 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config SECCOMP_FILTER_EXTENDED + bool "Extended BPF seccomp filters" + depends on SECCOMP_FILTER && BPF_SYSCALL + depends on !CHECKPOINT_RESTORE + help + Enables seccomp filters to be written in eBPF, as opposed + to just cBPF filters. + config HAVE_GCC_PLUGINS bool help diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 19b8349a3809..945c65c4e461 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) #ifdef CONFIG_CGROUP_BPF BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) #endif +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index c723a5c4e3ff..a7df3ba6cf25 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -5,7 +5,8 @@ #include #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ - SECCOMP_FILTER_FLAG_LOG) + SECCOMP_FILTER_FLAG_LOG | \ + SECCOMP_FILTER_FLAG_EXTENDED) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index db6bdc375126..5f96cb7ed954 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1,3 +1,4 @@ + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * @@ -133,6 +134,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_CGROUP_DEVICE, + BPF_PROG_TYPE_SECCOMP, }; enum bpf_attach_type { diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 2a0bd9dd104d..730af6c7ec2e 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -16,10 +16,11 @@ #define SECCOMP_SET_MODE_FILTER 1 #define SECCOMP_GET_ACTION_AVAIL 2 -/* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC 1 -#define SECCOMP_FILTER_FLAG_LOG 2 +/* Valid flags for SECCOMP_SET_MODE_FILTER */ +#define SECCOMP_FILTER_FLAG_TSYNC (1 << 0) +#define SECCOMP_FILTER_FLAG_LOG (1 << 1) +#define SECCOMP_FILTER_FLAG_EXTENDED (1 << 2) /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e24aa3241387..86d6ec8b916d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1202,6 +1202,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && + type != BPF_PROG_TYPE_SECCOMP && !capable(CAP_SYS_ADMIN)) return -EPERM; diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 940fa408a288..f8ddc4af1135 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -37,6 +37,7 @@ #include #include #include +#include /** * struct seccomp_filter - container for seccomp BPF programs @@ -367,17 +368,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); - /* - * Installing a seccomp filter requires that the task has - * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. - * This avoids scenarios where unprivileged tasks can affect the - * behavior of privileged children. - */ - if (!task_no_new_privs(current) && - security_capable_noaudit(current_cred(), current_user_ns(), - CAP_SYS_ADMIN) != 0) - return ERR_PTR(-EACCES); - /* Allocate a new seccomp_filter */ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); if (!sfilter) @@ -423,6 +413,48 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +/** + * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd + * @user_filter: pointer to the user data containing an fd. + * + * Returns 0 on success and non-zero otherwise. + */ +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *user_fd) +{ + struct seccomp_filter *sfilter; + struct bpf_prog *fp; + int fd; + + /* Fetch the fd from userspace */ + if (get_user(fd, (int __user *)user_fd)) + return ERR_PTR(-EFAULT); + + /* Allocate a new seccomp_filter */ + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); + if (!sfilter) + return ERR_PTR(-ENOMEM); + + fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP); + if (IS_ERR(fp)) { + kfree(sfilter); + return ERR_CAST(fp); + } + + sfilter->prog = fp; + refcount_set(&sfilter->usage, 1); + + return sfilter; +} +#else +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *filter_fd) +{ + return ERR_PTR(-EINVAL); +} +#endif + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -492,7 +524,10 @@ void get_seccomp_filter(struct task_struct *tsk) static inline void seccomp_filter_free(struct seccomp_filter *filter) { if (filter) { - bpf_prog_destroy(filter->prog); + if (bpf_prog_was_classic(filter->prog)) + bpf_prog_destroy(filter->prog); + else + bpf_prog_put(filter->prog); kfree(filter); } } @@ -844,7 +879,8 @@ static long seccomp_set_mode_strict(void) static long seccomp_set_mode_filter(unsigned int flags, const char __user *filter) { - const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; + /* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */ + const unsigned long filter_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; @@ -853,10 +889,31 @@ static long seccomp_set_mode_filter(unsigned int flags, return -EINVAL; /* Prepare the new filter before holding any locks. */ - prepared = seccomp_prepare_user_filter(filter); + if (flags & SECCOMP_FILTER_FLAG_EXTENDED) + prepared = seccomp_prepare_extended_filter(filter); + else + prepared = seccomp_prepare_user_filter(filter); + if (IS_ERR(prepared)) return PTR_ERR(prepared); + /* + * Installing a seccomp filter requires that the task has + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. + * This avoids scenarios where unprivileged tasks can affect the + * behavior of privileged children. + * + * This is checked after filter preparation because the user + * will get an EINVAL if their filter is invalid prior to the + * EPERM. + */ + if (!task_no_new_privs(current) && + security_capable_noaudit(current_cred(), current_user_ns(), + CAP_SYS_ADMIN) != 0) { + ret = -EACCES; + goto out_free; + } + /* * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags, spin_lock_irq(¤t->sighand->siglock); - if (!seccomp_may_assign_mode(seccomp_mode)) + if (!seccomp_may_assign_mode(filter_mode)) goto out; ret = seccomp_attach_filter(flags, prepared); @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags, /* Do not free the successfully attached filter. */ prepared = NULL; - seccomp_assign_mode(current, seccomp_mode); + seccomp_assign_mode(current, filter_mode); out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, if (IS_ERR(filter)) return PTR_ERR(filter); - fprog = filter->prog->orig_prog; - if (!fprog) { + if (!bpf_prog_was_classic(filter->prog)) { /* This must be a new non-cBPF filter, since we save * every cBPF filter's orig_prog above when * CONFIG_CHECKPOINT_RESTORE is enabled. @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, goto out; } + fprog = filter->prog->orig_prog; ret = fprog->len; if (!data) goto out; @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, return 0; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +static bool seccomp_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (type != BPF_READ) + return false; + + if (off < 0 || off + size > sizeof(struct seccomp_data)) + return false; + + switch (off) { + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): + return (size == sizeof(__u64)); + case bpf_ctx_range(struct seccomp_data, nr): + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); + case bpf_ctx_range(struct seccomp_data, arch): + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); + case bpf_ctx_range(struct seccomp_data, instruction_pointer): + return (size == FIELD_SIZEOF(struct seccomp_data, + instruction_pointer)); + } + + return false; +} + +static const struct bpf_func_proto * +seccomp_func_proto(enum bpf_func_id func_id) +{ + switch (func_id) { + case BPF_FUNC_get_current_uid_gid: + return &bpf_get_current_uid_gid_proto; + case BPF_FUNC_ktime_get_ns: + return &bpf_ktime_get_ns_proto; + case BPF_FUNC_get_prandom_u32: + return &bpf_get_prandom_u32_proto; + case BPF_FUNC_get_current_pid_tgid: + return &bpf_get_current_pid_tgid_proto; + default: + return NULL; + } +} + +const struct bpf_prog_ops seccomp_prog_ops = { +}; + +const struct bpf_verifier_ops seccomp_verifier_ops = { + .get_func_proto = seccomp_func_proto, + .is_valid_access = seccomp_is_valid_access, +}; +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ + static struct ctl_path seccomp_sysctl_path[] = { { .procname = "kernel", }, { .procname = "seccomp", }, -- 2.14.1 From brent.watson at tradeshowattendees.com Mon Feb 19 17:22:22 2018 From: brent.watson at tradeshowattendees.com (Brent Watson) Date: Mon, 19 Feb 2018 22:52:22 +0530 Subject: Southern California Linux Expo - SCALE Attendee List Message-ID: <033e01d3a9a6$37480b50$a5d821f0$@tradeshowattendees.com> Greetings, Would you be interested in Southern California Linux Expo - SCALE ? We can provide you with 3,600 attendee contacts with their complete details (First Name, Middle Name, Last Name, Company, Web site/URL, Contact number, Fax number, Email Id, Industry, Employee size, Revenue size, Country, state, Zip Code and LinkedIn URL.) Please keep me posted if you are interested and I shall get back to you with Pricing and more details. I Look forward to hearing from you. Warm regards, Brent Watson Demand Generation Executive If you don't wish to receive our newsletters, reply back with "unsubscribe " in subject line From adam_richter2004 at alice.it Mon Feb 19 21:14:37 2018 From: adam_richter2004 at alice.it (=?UTF-8?Q?adam=5Frichter=32=30=30=34?=) Date: Mon, 19 Feb 2018 21:14:37 +0000 Subject: No subject Message-ID: <15F759F1-6BA3-4F5E-F6D0-345CDEEB453C@alice.it> Hi Containers https://goo.gl/Lt8wfS From ebiederm at xmission.com Mon Feb 19 22:56:56 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 19 Feb 2018 16:56:56 -0600 Subject: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems In-Reply-To: (Miklos Szeredi's message of "Wed, 14 Feb 2018 13:28:12 +0100") References: <61a37f0b159dd56825696d8d3beb8eaffdf1f72f.1512041070.git.dongsu@kinvolk.io> Message-ID: <87mv04mvuv.fsf@xmission.com> Miklos Szeredi writes: > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: >> From: Seth Forshee >> >> The user in control of a super block should be allowed to freeze >> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW >> ioctls to require CAP_SYS_ADMIN in s_user_ns. > > Why is this required for unprivileged fuse? > > Fuse doesn't support freeze, so this seems to make no sense in the > context of this patchset. This isn't required to support fuse. It is a relaxation in permissions so it isn't strictly necessary for anything. Until just recently Seth and I work working through the vfs looking at what we need in general for unprivileged mounts. With fuse as our focus but we were not limiting ourselves to fuse. I have been putting off relaxation of permissions like this because they are not necessary for safety. But in general they do make sense. In practice I think all we need to worry about for fuse is the last 4 patches. Eric From ebiederm at xmission.com Mon Feb 19 23:09:51 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 19 Feb 2018 17:09:51 -0600 Subject: [PATCH v5 00/11] FUSE mounts from non-init user namespaces In-Reply-To: (Alban Crequy's message of "Thu, 18 Jan 2018 15:58:41 +0100") References: <877etbcmnd.fsf@xmission.com> Message-ID: <87inaslgow.fsf@xmission.com> Alban Crequy writes: > Hi Eric, > > Do you have some cycles for this now that it is the new year? > > A review on the associated ima issue would also be appreciated: > https://www.mail-archive.com/linux-kernel at vger.kernel.org/msg1587678.html It has taken me longer than I expected but I do have time now. I am moving through these patches and issues a little slowly I do intend to get us through the fuse issues this development cycle if at all possible. I think for starters we should restrict ourselves to the last 4 patches aka (8, 9, 10, 11). In particular we should concentrate on [8/11] fuse: Support fuse filesystems outside of init_user_ns [9/11] fuse: Restrict allow_other to the superblock's namespace or a descendant The tricky issues are handled in the vfs, and I think the remaining tricky issues are evm and ima. Which are close enough to be resolved that we can count them as resolved. Once we have 8 & 9 reviewed and merged we can double check there isn't some silly reason not to set FS_USERNS_MOUNT on fuse and then enable it. I would like to double check and ensure there are not silly issues with posix acls or anything else in the vfs. But I think except for a silly oversight we are good. I should probably also add a patch that adds to Documentation/filesystems that explains what the vfs does for unprivileged mounts. So that I can point people working on filesystems and are thinking about enabling user namespace mounts at the documentation for what the vfs does. That would also provide a good checklist to ensure the way the vfs handles things is sufficient for fuse. As for the earlier patches that enable things. Overall they are good. They are slightly dangerous as they enable more code paths to unprivileged users. But mostly I think they are not immediately necessary and as such a distraction to getting this code in. That said once we get the fuse bits reviewed merged I will be more than happy to merge the relaxation of permission checks that we can perform now that s_user_ns exists. Eric From ebiederm at xmission.com Mon Feb 19 23:16:59 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 19 Feb 2018 17:16:59 -0600 Subject: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant In-Reply-To: (Dongsu Park's message of "Fri, 22 Dec 2017 15:32:33 +0100") References: Message-ID: <87d110lgd0.fsf@xmission.com> Dongsu Park writes: > From: Seth Forshee > > Unprivileged users are normally restricted from mounting with the > allow_other option by system policy, but this could be bypassed > for a mount done with user namespace root permissions. In such > cases allow_other should not allow users outside the userns > to access the mount as doing so would give the unprivileged user > the ability to manipulate processes it would otherwise be unable > to manipulate. Restrict allow_other to apply to users in the same > userns used at mount or a descendant of that namespace. Also > export current_in_userns() for use by fuse when built as a > module. > > Patch v4 is available: https://patchwork.kernel.org/patch/8944671/ > > Cc: linux-fsdevel at vger.kernel.org > Cc: linux-kernel at vger.kernel.org > Cc: "Eric W. Biederman" > Cc: Serge Hallyn > Cc: Miklos Szeredi > Signed-off-by: Seth Forshee > Signed-off-by: Dongsu Park Reviewed-by: "Eric W. Biederman" > --- > fs/fuse/dir.c | 2 +- > kernel/user_namespace.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index ad1cfac1..d41559a0 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc) > const struct cred *cred; > > if (fc->allow_other) > - return 1; > + return current_in_userns(fc->user_ns); > > cred = current_cred(); > if (uid_eq(cred->euid, fc->user_id) && > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 246d4d4c..492c255e 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns) > { > return in_userns(target_ns, current_user_ns()); > } > +EXPORT_SYMBOL(current_in_userns); > > static inline struct user_namespace *to_user_ns(struct ns_common *ns) > { From daniel at iogearbox.net Tue Feb 20 00:00:42 2018 From: daniel at iogearbox.net (Daniel Borkmann) Date: Tue, 20 Feb 2018 01:00:42 +0100 Subject: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <20180219162159.GA11474@ircssh-2.c.rugged-nimbus-611.internal> References: <20180219162159.GA11474@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: <37135c70-bb09-c4ac-e81d-dc161724292b@iogearbox.net> On 02/19/2018 05:22 PM, Sargun Dhillon wrote: > This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant > to be used for seccomp filters as an alternative to cBPF filters. The > program type has relatively limited capabilities in terms of helpers, > but that can be extended later on. > > The eBPF code loading is separated from attachment of the filter, so > a privileged user can load the filter, and pass it back to an > unprivileged user who can attach it and use it at a later time. > > In order to attach the filter itself, you need to supply a flag to the > seccomp syscall indicating that a eBPF filter is being attached, as > opposed to a cBPF one. Verification occurs at program load time, > so the user should only receive errors related to attachment. > > Signed-off-by: Sargun Dhillon [...] > @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags, > > spin_lock_irq(¤t->sighand->siglock); > > - if (!seccomp_may_assign_mode(seccomp_mode)) > + if (!seccomp_may_assign_mode(filter_mode)) > goto out; > > ret = seccomp_attach_filter(flags, prepared); > @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags, > /* Do not free the successfully attached filter. */ > prepared = NULL; > > - seccomp_assign_mode(current, seccomp_mode); > + seccomp_assign_mode(current, filter_mode); > out: > spin_unlock_irq(¤t->sighand->siglock); > if (flags & SECCOMP_FILTER_FLAG_TSYNC) > @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, > if (IS_ERR(filter)) > return PTR_ERR(filter); > > - fprog = filter->prog->orig_prog; > - if (!fprog) { > + if (!bpf_prog_was_classic(filter->prog)) { This is actually a bug, see f8e529ed941b ("seccomp, ptrace: add support for dumping seccomp filters") and would cause a NULL ptr deref in case the filter was created with bpf_prog_create_from_user() with save_orig as false, so the if (!fprog) test for cBPF cannot be removed from here. > /* This must be a new non-cBPF filter, since we save > * every cBPF filter's orig_prog above when > * CONFIG_CHECKPOINT_RESTORE is enabled. > @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, > goto out; > } > > + fprog = filter->prog->orig_prog; > ret = fprog->len; (See above.) > if (!data) > goto out; > @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, > return 0; > } > > +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED > +static bool seccomp_is_valid_access(int off, int size, > + enum bpf_access_type type, > + struct bpf_insn_access_aux *info) > +{ > + if (type != BPF_READ) > + return false; > + > + if (off < 0 || off + size > sizeof(struct seccomp_data)) > + return false; if (off % size != 0) return false; > + switch (off) { > + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): > + return (size == sizeof(__u64)); > + case bpf_ctx_range(struct seccomp_data, nr): > + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); > + case bpf_ctx_range(struct seccomp_data, arch): > + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); > + case bpf_ctx_range(struct seccomp_data, instruction_pointer): > + return (size == FIELD_SIZEOF(struct seccomp_data, > + instruction_pointer)); default: return false; [...] > +static const struct bpf_func_proto * > +seccomp_func_proto(enum bpf_func_id func_id) > +{ > + switch (func_id) { > + case BPF_FUNC_get_current_uid_gid: > + return &bpf_get_current_uid_gid_proto; > + case BPF_FUNC_ktime_get_ns: > + return &bpf_ktime_get_ns_proto; > + case BPF_FUNC_get_prandom_u32: > + return &bpf_get_prandom_u32_proto; > + case BPF_FUNC_get_current_pid_tgid: > + return &bpf_get_current_pid_tgid_proto; Do you have a use-case description for the above helpers? Is the prandom/ktime one for simulating errors coming from the syscall? And the other two for orchestration purposes? One use case this work could enable would be to implement state machines in BPF for BPF-seccomp and enabling a more fine-grained / tiny subset of syscalls based on the state the prog is in while the rest is all blocked out - as opposed to a global white/black-list of syscalls the app can do in general. Getting to such an app model would probably be rather challenging at least for complex apps. We'd need some sort of scratch buffer for keeping the state for this though, e.g. either map with single slot or per thread scratch space. Anyway, just a thought. > + default: > + return NULL; > + } > +} > + > +const struct bpf_prog_ops seccomp_prog_ops = { > +}; > + > +const struct bpf_verifier_ops seccomp_verifier_ops = { > + .get_func_proto = seccomp_func_proto, > + .is_valid_access = seccomp_is_valid_access, > +}; > +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ > + > static struct ctl_path seccomp_sysctl_path[] = { > { .procname = "kernel", }, > { .procname = "seccomp", }, > From ebiederm at xmission.com Tue Feb 20 02:12:42 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 19 Feb 2018 20:12:42 -0600 Subject: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: (Dongsu Park's message of "Fri, 22 Dec 2017 15:32:32 +0100") References: Message-ID: <877er8if39.fsf@xmission.com> Dongsu Park writes: > From: Seth Forshee > > In order to support mounts from namespaces other than > init_user_ns, fuse must translate uids and gids to/from the > userns of the process servicing requests on /dev/fuse. This > patch does that, with a couple of restrictions on the namespace: > > - The userns for the fuse connection is fixed to the namespace > from which /dev/fuse is opened. > > - The namespace must be the same as s_user_ns. > > These restrictions simplify the implementation by avoiding the > need to pass around userns references and by allowing fuse to > rely on the checks in inode_change_ok for ownership changes. > Either restriction could be relaxed in the future if needed. > > For cuse the namespace used for the connection is also simply > current_user_ns() at the time /dev/cuse is opened. > > Patch v4 is available: https://patchwork.kernel.org/patch/8944661/ > > Cc: linux-fsdevel at vger.kernel.org > Cc: linux-kernel at vger.kernel.org > Cc: Miklos Szeredi > Signed-off-by: Seth Forshee > Signed-off-by: Dongsu Park > --- > fs/fuse/cuse.c | 3 ++- > fs/fuse/dev.c | 11 ++++++++--- > fs/fuse/dir.c | 14 +++++++------- > fs/fuse/fuse_i.h | 6 +++++- > fs/fuse/inode.c | 31 +++++++++++++++++++------------ > 5 files changed, 41 insertions(+), 24 deletions(-) > > diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c > index e9e97803..b1b83259 100644 > --- a/fs/fuse/cuse.c > +++ b/fs/fuse/cuse.c > @@ -48,6 +48,7 @@ > #include > #include > #include > +#include > > #include "fuse_i.h" > > @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file) > if (!cc) > return -ENOMEM; > As noticed in the review this should probably say: if (current_user_ns() != &init_user_ns) return -EINVAL; Just so we don't need to think about cuse being opened in a user namespace at this point. It is probably harmless. But it isn't what we are focusing on. > - fuse_conn_init(&cc->fc); > + fuse_conn_init(&cc->fc, current_user_ns()); > > fud = fuse_dev_alloc(&cc->fc); > if (!fud) { > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index 17f0d05b..0f780e16 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) > > static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) > { > - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); > - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); > + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); > + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); > req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); > } > > @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, > __set_bit(FR_WAITING, &req->flags); > if (for_background) > __set_bit(FR_BACKGROUND, &req->flags); > + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) { > + fuse_put_request(fc, req); > + return ERR_PTR(-EOVERFLOW); > + } > > return req; > > @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, > in = &req->in; > reqsize = in->h.len; > > - if (task_active_pid_ns(current) != fc->pid_ns) { > + if (task_active_pid_ns(current) != fc->pid_ns || > + current_user_ns() != fc->user_ns) { > rcu_read_lock(); > in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); > rcu_read_unlock(); The hunk above is a rebase error. I believe it started out by erroring out in the same case the pid namespace case errored out. Miklos has a good point that we need to handle the case where we have servers running in jails of one sort or another because at least sandstorm runs applications in that fashion, and we have previously had error reports about that configuration breaking. I think we can easily fix that. Either by adding extra translation as we did for the pid namespace or changing the user namespace used on the connection. I believe extra translation like we did with the pid namespace will be more consistent. And again it won't be a special case except possibly during mount. Of course there is weirdness there. Eric From sargun at sargun.me Wed Feb 21 07:16:20 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 20 Feb 2018 23:16:20 -0800 Subject: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <37135c70-bb09-c4ac-e81d-dc161724292b@iogearbox.net> References: <20180219162159.GA11474@ircssh-2.c.rugged-nimbus-611.internal> <37135c70-bb09-c4ac-e81d-dc161724292b@iogearbox.net> Message-ID: On Mon, Feb 19, 2018 at 4:00 PM, Daniel Borkmann wrote: > On 02/19/2018 05:22 PM, Sargun Dhillon wrote: >> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant >> to be used for seccomp filters as an alternative to cBPF filters. The >> program type has relatively limited capabilities in terms of helpers, >> but that can be extended later on. >> >> The eBPF code loading is separated from attachment of the filter, so >> a privileged user can load the filter, and pass it back to an >> unprivileged user who can attach it and use it at a later time. >> >> In order to attach the filter itself, you need to supply a flag to the >> seccomp syscall indicating that a eBPF filter is being attached, as >> opposed to a cBPF one. Verification occurs at program load time, >> so the user should only receive errors related to attachment. >> >> Signed-off-by: Sargun Dhillon > [...] >> @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags, >> >> spin_lock_irq(¤t->sighand->siglock); >> >> - if (!seccomp_may_assign_mode(seccomp_mode)) >> + if (!seccomp_may_assign_mode(filter_mode)) >> goto out; >> >> ret = seccomp_attach_filter(flags, prepared); >> @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags, >> /* Do not free the successfully attached filter. */ >> prepared = NULL; >> >> - seccomp_assign_mode(current, seccomp_mode); >> + seccomp_assign_mode(current, filter_mode); >> out: >> spin_unlock_irq(¤t->sighand->siglock); >> if (flags & SECCOMP_FILTER_FLAG_TSYNC) >> @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, >> if (IS_ERR(filter)) >> return PTR_ERR(filter); >> >> - fprog = filter->prog->orig_prog; >> - if (!fprog) { >> + if (!bpf_prog_was_classic(filter->prog)) { > > This is actually a bug, see f8e529ed941b ("seccomp, ptrace: add support for > dumping seccomp filters") and would cause a NULL ptr deref in case the filter > was created with bpf_prog_create_from_user() with save_orig as false, so the > if (!fprog) test for cBPF cannot be removed from here. > >> /* This must be a new non-cBPF filter, since we save >> * every cBPF filter's orig_prog above when >> * CONFIG_CHECKPOINT_RESTORE is enabled. >> @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, >> goto out; >> } >> >> + fprog = filter->prog->orig_prog; >> ret = fprog->len; > > (See above.) > >> if (!data) >> goto out; >> @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, >> return 0; >> } >> >> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED >> +static bool seccomp_is_valid_access(int off, int size, >> + enum bpf_access_type type, >> + struct bpf_insn_access_aux *info) >> +{ >> + if (type != BPF_READ) >> + return false; >> + >> + if (off < 0 || off + size > sizeof(struct seccomp_data)) >> + return false; > > if (off % size != 0) > return false; > Won't this break access to the instruction pointer, and args if sizeof(int) != 4? Don't know any if any architectures fall under that. >> + switch (off) { >> + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): >> + return (size == sizeof(__u64)); >> + case bpf_ctx_range(struct seccomp_data, nr): >> + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); >> + case bpf_ctx_range(struct seccomp_data, arch): >> + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); >> + case bpf_ctx_range(struct seccomp_data, instruction_pointer): >> + return (size == FIELD_SIZEOF(struct seccomp_data, >> + instruction_pointer)); > > default: > return false; > > [...] >> +static const struct bpf_func_proto * >> +seccomp_func_proto(enum bpf_func_id func_id) >> +{ >> + switch (func_id) { >> + case BPF_FUNC_get_current_uid_gid: >> + return &bpf_get_current_uid_gid_proto; >> + case BPF_FUNC_ktime_get_ns: >> + return &bpf_ktime_get_ns_proto; >> + case BPF_FUNC_get_prandom_u32: >> + return &bpf_get_prandom_u32_proto; >> + case BPF_FUNC_get_current_pid_tgid: >> + return &bpf_get_current_pid_tgid_proto; > > Do you have a use-case description for the above helpers? Is the prandom/ktime > one for simulating errors coming from the syscall? And the other two for > orchestration purposes? > My specific use case with uid_guid and pid is for containers, I have a use case where I can put systemd, or a privileged init system into a container, pid 1, running as uid 0 will get access to whetever is needed in order to wire up an init system. If the user forks, or setuid / setgid to another level, the access is lost, and they become unprivileged. Depending on the container, different levels of access are needed by the init, so seccomp-ebpf is a bit better here as compared to say apparmor. prandom is for testing. ktime is for testing and to limit access after some time period occurs. Example: In the first 30 seconds of the container's life time, it has privileges to wire up a file system, but this is then shut down. It's good for 3rd party software, until we have a map mechanism where you can hook a probe in to see once the program has initialized, and then you can revoke access to these things. > One use case this work could enable would be to implement state machines in BPF > for BPF-seccomp and enabling a more fine-grained / tiny subset of syscalls based > on the state the prog is in while the rest is all blocked out - as opposed to a > global white/black-list of syscalls the app can do in general. Getting to such > an app model would probably be rather challenging at least for complex apps. We'd > need some sort of scratch buffer for keeping the state for this though, e.g. either > map with single slot or per thread scratch space. Anyway, just a thought. > Yeah, are you thinking a per task space? I'm simply thinking "after you notice init is completed for PID x, go ahead and revoke access" -- and either stash this in an LRU map, or something more clever. >> + default: >> + return NULL; >> + } >> +} >> + >> +const struct bpf_prog_ops seccomp_prog_ops = { >> +}; >> + >> +const struct bpf_verifier_ops seccomp_verifier_ops = { >> + .get_func_proto = seccomp_func_proto, >> + .is_valid_access = seccomp_is_valid_access, >> +}; >> +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ >> + >> static struct ctl_path seccomp_sysctl_path[] = { >> { .procname = "kernel", }, >> { .procname = "seccomp", }, >> > From sargun at sargun.me Wed Feb 21 07:31:13 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Tue, 20 Feb 2018 23:31:13 -0800 Subject: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <37135c70-bb09-c4ac-e81d-dc161724292b@iogearbox.net> References: <20180219162159.GA11474@ircssh-2.c.rugged-nimbus-611.internal> <37135c70-bb09-c4ac-e81d-dc161724292b@iogearbox.net> Message-ID: On Mon, Feb 19, 2018 at 4:00 PM, Daniel Borkmann wrote: > On 02/19/2018 05:22 PM, Sargun Dhillon wrote: >> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant >> to be used for seccomp filters as an alternative to cBPF filters. The >> program type has relatively limited capabilities in terms of helpers, >> but that can be extended later on. >> >> The eBPF code loading is separated from attachment of the filter, so >> a privileged user can load the filter, and pass it back to an >> unprivileged user who can attach it and use it at a later time. >> >> In order to attach the filter itself, you need to supply a flag to the >> seccomp syscall indicating that a eBPF filter is being attached, as >> opposed to a cBPF one. Verification occurs at program load time, >> so the user should only receive errors related to attachment. >> >> Signed-off-by: Sargun Dhillon > [...] >> @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags, >> >> spin_lock_irq(¤t->sighand->siglock); >> >> - if (!seccomp_may_assign_mode(seccomp_mode)) >> + if (!seccomp_may_assign_mode(filter_mode)) >> goto out; >> >> ret = seccomp_attach_filter(flags, prepared); >> @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags, >> /* Do not free the successfully attached filter. */ >> prepared = NULL; >> >> - seccomp_assign_mode(current, seccomp_mode); >> + seccomp_assign_mode(current, filter_mode); >> out: >> spin_unlock_irq(¤t->sighand->siglock); >> if (flags & SECCOMP_FILTER_FLAG_TSYNC) >> @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, >> if (IS_ERR(filter)) >> return PTR_ERR(filter); >> >> - fprog = filter->prog->orig_prog; >> - if (!fprog) { >> + if (!bpf_prog_was_classic(filter->prog)) { > > This is actually a bug, see f8e529ed941b ("seccomp, ptrace: add support for > dumping seccomp filters") and would cause a NULL ptr deref in case the filter > was created with bpf_prog_create_from_user() with save_orig as false, so the > if (!fprog) test for cBPF cannot be removed from here. > Isn't this function within: #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE) #endif And, above, where bpf_prog_create_from_user is, save_prog is derived from: const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); Are there any other places this can be loaded, or this function can be exposes if CONFIG_CHECKPOINT_RESTORE = n? >> /* This must be a new non-cBPF filter, since we save >> * every cBPF filter's orig_prog above when >> * CONFIG_CHECKPOINT_RESTORE is enabled. >> @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, >> goto out; >> } >> >> + fprog = filter->prog->orig_prog; >> ret = fprog->len; > > (See above.) > >> if (!data) >> goto out; >> @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, >> return 0; >> } >> >> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED >> +static bool seccomp_is_valid_access(int off, int size, >> + enum bpf_access_type type, >> + struct bpf_insn_access_aux *info) >> +{ >> + if (type != BPF_READ) >> + return false; >> + >> + if (off < 0 || off + size > sizeof(struct seccomp_data)) >> + return false; > > if (off % size != 0) > return false; > >> + switch (off) { >> + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): >> + return (size == sizeof(__u64)); >> + case bpf_ctx_range(struct seccomp_data, nr): >> + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); >> + case bpf_ctx_range(struct seccomp_data, arch): >> + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); >> + case bpf_ctx_range(struct seccomp_data, instruction_pointer): >> + return (size == FIELD_SIZEOF(struct seccomp_data, >> + instruction_pointer)); > > default: > return false; > > [...] >> +static const struct bpf_func_proto * >> +seccomp_func_proto(enum bpf_func_id func_id) >> +{ >> + switch (func_id) { >> + case BPF_FUNC_get_current_uid_gid: >> + return &bpf_get_current_uid_gid_proto; >> + case BPF_FUNC_ktime_get_ns: >> + return &bpf_ktime_get_ns_proto; >> + case BPF_FUNC_get_prandom_u32: >> + return &bpf_get_prandom_u32_proto; >> + case BPF_FUNC_get_current_pid_tgid: >> + return &bpf_get_current_pid_tgid_proto; > > Do you have a use-case description for the above helpers? Is the prandom/ktime > one for simulating errors coming from the syscall? And the other two for > orchestration purposes? > > One use case this work could enable would be to implement state machines in BPF > for BPF-seccomp and enabling a more fine-grained / tiny subset of syscalls based > on the state the prog is in while the rest is all blocked out - as opposed to a > global white/black-list of syscalls the app can do in general. Getting to such > an app model would probably be rather challenging at least for complex apps. We'd > need some sort of scratch buffer for keeping the state for this though, e.g. either > map with single slot or per thread scratch space. Anyway, just a thought. > >> + default: >> + return NULL; >> + } >> +} >> + >> +const struct bpf_prog_ops seccomp_prog_ops = { >> +}; >> + >> +const struct bpf_verifier_ops seccomp_verifier_ops = { >> + .get_func_proto = seccomp_func_proto, >> + .is_valid_access = seccomp_is_valid_access, >> +}; >> +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ >> + >> static struct ctl_path seccomp_sysctl_path[] = { >> { .procname = "kernel", }, >> { .procname = "seccomp", }, >> > From ebiederm at xmission.com Wed Feb 21 20:24:30 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 21 Feb 2018 14:24:30 -0600 Subject: [PATCH v6 0/6] fuse: mounts from non-init user namespaces In-Reply-To: (Dongsu Park's message of "Fri, 22 Dec 2017 15:32:24 +0100") References: Message-ID: <878tbmf5vl.fsf@xmission.com> This patchset builds on the work by Donsu Park and Seth Forshee and is reduced to the set of patches that just affect fuse. The non-fuse patches are far enough along we can ignore them except possibly for the question of when does FS_USERNS_MOUNT get set in fuse_fs_type. Fuse with a block device has been left as an exercise for a later time. I had to change the core of this patchset around some as the previous patches were showing signs of bitrot. Some important explanations were missing, some important functionality was missing, and xattr handling was completely absent. Miklos can you take a look and see what you think? I think this much of the fuse changes are ready, and as such I would like to get them in this development cycle if possible. My apologies if I have lost someone's ack or review somewhere. Let me know and I will fix it. These changes are also available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v6 Eric W. Biederman (4): fuse: Remove the buggy retranslation of pids in fuse_dev_do_read fuse: Fail all requests with invalid uids or gids fuse: Support fuse filesystems outside of init_user_ns fuse: Ensure posix acls are translated outside of init_user_ns Seth Forshee (1): fuse: Restrict allow_other to the superblock's namespace or a descendant fs/fuse/acl.c | 4 ++-- fs/fuse/cuse.c | 7 ++++++- fs/fuse/dev.c | 26 +++++++++++++------------- fs/fuse/dir.c | 16 ++++++++-------- fs/fuse/fuse_i.h | 7 ++++++- fs/fuse/inode.c | 38 ++++++++++++++++++++++++++------------ fs/fuse/xattr.c | 43 +++++++++++++++++++++++++++++++++++++++++++ kernel/user_namespace.c | 1 + 8 files changed, 105 insertions(+), 37 deletions(-) Eric From ebiederm at xmission.com Wed Feb 21 20:29:04 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 21 Feb 2018 14:29:04 -0600 Subject: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read In-Reply-To: <878tbmf5vl.fsf@xmission.com> References: <878tbmf5vl.fsf@xmission.com> Message-ID: <20180221202908.17258-1-ebiederm@xmission.com> At the point of fuse_dev_do_read the user space process that initiated the action on the fuse filesystem may no longer exist. The process have been killed or may have fired an asynchronous request and exited. If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)" will either return a pid of 0, or in the unlikely event that the pid has been reallocated it can return practically any pid. Any pid is possible as the pid allocator allocates pid numbers in different pid namespaces independently. The only way to make translation in fuse_dev_do_read reliable is to call get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in fuse_dev_do_read. That reference counting in other contexts has been shown to bounce cache lines between processors and in general be slow. So that is not desirable. The only known user of running the fuse server in a different pid namespace from the filesystem does not care what the pids are in the fuse messages so removing this code should not matter. Getting the translation to a server running outside of the pid namespace of a container can still be achieved by playing setns games at mount time. It is also possible to add an option to pass a pid namespace into the fuse filesystem at mount time. Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns") Signed-off-by: "Eric W. Biederman" --- fs/fuse/dev.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 5d06384c2cae..0fb58f364fa6 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, in = &req->in; reqsize = in->h.len; - if (task_active_pid_ns(current) != fc->pid_ns) { - rcu_read_lock(); - in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); - rcu_read_unlock(); - } - /* If request is too large, reply with an error and restart the read */ if (nbytes < reqsize) { req->out.h.error = -EIO; -- 2.14.1 From ebiederm at xmission.com Wed Feb 21 20:29:05 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 21 Feb 2018 14:29:05 -0600 Subject: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids In-Reply-To: <878tbmf5vl.fsf@xmission.com> References: <878tbmf5vl.fsf@xmission.com> Message-ID: <20180221202908.17258-2-ebiederm@xmission.com> Upon a cursory examinination the uid and gid of a fuse request are necessary for correct operation. Failing a fuse request where those values are not reliable seems a straight forward and reliable means of ensuring that fuse requests with bad data are not sent or processed. In most cases the vfs will avoid actions it suspects will cause an inode write back of an inode with an invalid uid or gid. But that does not map precisely to what fuse is doing, so test for this and solve this at the fuse level as well. Performing this work in fuse_req_init_context is cheap as the code is already performing the translation here and only needs to check the result of the translation to see if things are not representable in a form the fuse server can handle. Signed-off-by: Eric W. Biederman --- fs/fuse/dev.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 0fb58f364fa6..216db3f51a31 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req) refcount_dec(&req->count); } -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(&init_user_ns, current_fsuid()); + req->in.h.gid = from_kgid(&init_user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); + + return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1)); } void fuse_set_initialized(struct fuse_conn *fc) @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, wake_up(&fc->blocked_waitq); goto out; } - - fuse_req_init_context(fc, req); __set_bit(FR_WAITING, &req->flags); if (for_background) __set_bit(FR_BACKGROUND, &req->flags); - + if (unlikely(!fuse_req_init_context(fc, req))) { + fuse_put_request(fc, req); + return ERR_PTR(-EOVERFLOW); + } return req; out: @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc, if (!req) req = get_reserved_req(fc, file); - fuse_req_init_context(fc, req); __set_bit(FR_WAITING, &req->flags); __clear_bit(FR_BACKGROUND, &req->flags); + if (unlikely(!fuse_req_init_context(fc, req))) { + fuse_put_request(fc, req); + return ERR_PTR(-EOVERFLOW); + } return req; } -- 2.14.1 From ebiederm at xmission.com Wed Feb 21 20:29:06 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 21 Feb 2018 14:29:06 -0600 Subject: [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: <878tbmf5vl.fsf@xmission.com> References: <878tbmf5vl.fsf@xmission.com> Message-ID: <20180221202908.17258-3-ebiederm@xmission.com> In order to support mounts from namespaces other than init_user_ns, fuse must translate uids and gids to/from the userns of the process servicing requests on /dev/fuse. This patch does that, with a couple of restrictions on the namespace: - The userns for the fuse connection is fixed to the namespace from which /dev/fuse is opened. - The namespace must be the same as s_user_ns. These restrictions simplify the implementation by avoiding the need to pass around userns references and by allowing fuse to rely on the checks in setattr_prepare for ownership changes. Either restriction could be relaxed in the future if needed. For cuse the userns used is the opener of /dev/cuse. Semantically the cuse support does not appear safe for unprivileged users. Practically the permissions on /dev/cuse only make it accessible to the global root user. If something slips through the cracks in a user namespace the only users who will be able to use the cuse device are those users mapped into the user namespace. Translation in the posix acl is updated to use the uuser namespace of the filesystem. Avoiding cases which might bypass this translation is handled in a following change. This change is stronlgy based on a similar change from Seth Forshee and Dongsu Park. Cc: linux-fsdevel at vger.kernel.org Cc: linux-kernel at vger.kernel.org Cc: Miklos Szeredi Cc: Cc: Dongsu Park Signed-off-by: Eric W. Biederman --- fs/fuse/acl.c | 4 ++-- fs/fuse/cuse.c | 7 ++++++- fs/fuse/dev.c | 4 ++-- fs/fuse/dir.c | 14 +++++++------- fs/fuse/fuse_i.h | 6 +++++- fs/fuse/inode.c | 31 +++++++++++++++++++------------ 6 files changed, 41 insertions(+), 25 deletions(-) diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index ec85765502f1..5a48cee6d7d3 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type) return ERR_PTR(-ENOMEM); size = fuse_getxattr(inode, name, value, PAGE_SIZE); if (size > 0) - acl = posix_acl_from_xattr(&init_user_ns, value, size); + acl = posix_acl_from_xattr(fc->user_ns, value, size); else if ((size == 0) || (size == -ENODATA) || (size == -EOPNOTSUPP && fc->no_getxattr)) acl = NULL; @@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) if (!value) return -ENOMEM; - ret = posix_acl_to_xattr(&init_user_ns, acl, value, size); + ret = posix_acl_to_xattr(fc->user_ns, acl, value, size); if (ret < 0) { kfree(value); return ret; diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index e9e97803442a..036ee477669e 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "fuse_i.h" @@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file) if (!cc) return -ENOMEM; - fuse_conn_init(&cc->fc); + /* + * Limit the cuse channel to requests that can + * be represented in file->f_cred->user_ns. + */ + fuse_conn_init(&cc->fc, file->f_cred->user_ns); fud = fuse_dev_alloc(&cc->fc); if (!fud) { diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 216db3f51a31..338cfda3eb8f 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid(&init_user_ns, current_fsuid()); - req->in.h.gid = from_kgid(&init_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1)); diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 24967382a7b1..ad1cfac1942f 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, stat->ino = attr->ino; stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); stat->nlink = attr->nlink; - stat->uid = make_kuid(&init_user_ns, attr->uid); - stat->gid = make_kgid(&init_user_ns, attr->gid); + stat->uid = make_kuid(fc->user_ns, attr->uid); + stat->gid = make_kgid(fc->user_ns, attr->gid); stat->rdev = inode->i_rdev; stat->atime.tv_sec = attr->atime; stat->atime.tv_nsec = attr->atimensec; @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, - bool trust_local_cmtime) +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, + struct fuse_setattr_in *arg, bool trust_local_cmtime) { unsigned ivalid = iattr->ia_valid; if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; if (ivalid & ATTR_UID) - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid); + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); if (ivalid & ATTR_GID) - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid); + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid); if (ivalid & ATTR_SIZE) arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size; if (ivalid & ATTR_ATIME) { @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); - iattr_to_fattr(attr, &inarg, trust_local_cmtime); + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime); if (file) { struct fuse_file *ff = file->private_data; inarg.valid |= FATTR_FH; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index c4c093bbf456..7772e2b4057e 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -26,6 +26,7 @@ #include #include #include +#include /** Max number of pages that can be used in a single read request */ #define FUSE_MAX_PAGES_PER_REQ 32 @@ -466,6 +467,9 @@ struct fuse_conn { /** The pid namespace for this mount */ struct pid_namespace *pid_ns; + /** The user namespace for this mount */ + struct user_namespace *user_ns; + /** Maximum read size */ unsigned max_read; @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc); /** * Initialize fuse_conn */ -void fuse_conn_init(struct fuse_conn *fc); +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns); /** * Release reference to fuse_conn diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 624f18bbfd2b..e018dc3999f4 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, inode->i_ino = fuse_squash_ino(attr->ino); inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); set_nlink(inode, attr->nlink); - inode->i_uid = make_kuid(&init_user_ns, attr->uid); - inode->i_gid = make_kgid(&init_user_ns, attr->gid); + inode->i_uid = make_kuid(fc->user_ns, attr->uid); + inode->i_gid = make_kgid(fc->user_ns, attr->gid); inode->i_blocks = attr->blocks; inode->i_atime.tv_sec = attr->atime; inode->i_atime.tv_nsec = attr->atimensec; @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res) return err; } -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev, + struct user_namespace *user_ns) { char *p; memset(d, 0, sizeof(struct fuse_mount_data)); @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) case OPT_USER_ID: if (fuse_match_uint(&args[0], &uv)) return 0; - d->user_id = make_kuid(current_user_ns(), uv); + d->user_id = make_kuid(user_ns, uv); if (!uid_valid(d->user_id)) return 0; d->user_id_present = 1; @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) case OPT_GROUP_ID: if (fuse_match_uint(&args[0], &uv)) return 0; - d->group_id = make_kgid(current_user_ns(), uv); + d->group_id = make_kgid(user_ns, uv); if (!gid_valid(d->group_id)) return 0; d->group_id_present = 1; @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) struct super_block *sb = root->d_sb; struct fuse_conn *fc = get_fuse_conn_super(sb); - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id)); - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id)); + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id)); + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id)); if (fc->default_permissions) seq_puts(m, ",default_permissions"); if (fc->allow_other) @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq) fpq->connected = 1; } -void fuse_conn_init(struct fuse_conn *fc) +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns) { memset(fc, 0, sizeof(*fc)); spin_lock_init(&fc->lock); @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc) fc->attr_version = 1; get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); fc->pid_ns = get_pid_ns(task_active_pid_ns(current)); + fc->user_ns = get_user_ns(user_ns); } EXPORT_SYMBOL_GPL(fuse_conn_init); @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc) if (fc->destroy_req) fuse_request_free(fc->destroy_req); put_pid_ns(fc->pid_ns); + put_user_ns(fc->user_ns); fc->release(fc); } } @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION); - if (!parse_fuse_opt(data, &d, is_bdev)) + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns)) goto err; if (is_bdev) { @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) if (!file) goto err; - if ((file->f_op != &fuse_dev_operations) || - (file->f_cred->user_ns != &init_user_ns)) + /* + * Require mount to happen from the same user namespace which + * opened /dev/fuse to prevent potential attacks. + */ + if (file->f_op != &fuse_dev_operations || + file->f_cred->user_ns != sb->s_user_ns) goto err_fput; fc = kmalloc(sizeof(*fc), GFP_KERNEL); @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) if (!fc) goto err_fput; - fuse_conn_init(fc); + fuse_conn_init(fc, sb->s_user_ns); fc->release = fuse_free_conn; fud = fuse_dev_alloc(fc); -- 2.14.1 From ebiederm at xmission.com Wed Feb 21 20:29:07 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 21 Feb 2018 14:29:07 -0600 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: <878tbmf5vl.fsf@xmission.com> References: <878tbmf5vl.fsf@xmission.com> Message-ID: <20180221202908.17258-4-ebiederm@xmission.com> Ensure the translation happens by failing to read or write posix acls when the filesystem has not indicated it supports posix acls. This ensures that modern cached posix acl support is available and used when dealing with posix acls. This is important because only that path has the code to convernt the uids and gids in posix acls into the user namespace of a fuse filesystem. Signed-off-by: "Eric W. Biederman" --- fs/fuse/fuse_i.h | 1 + fs/fuse/inode.c | 7 +++++++ fs/fuse/xattr.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 51 insertions(+) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 7772e2b4057e..986fa2b043ab 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size); int fuse_removexattr(struct inode *inode, const char *name); extern const struct xattr_handler *fuse_xattr_handlers[]; extern const struct xattr_handler *fuse_acl_xattr_handlers[]; +extern const struct xattr_handler *fuse_no_acl_xattr_handlers[]; struct posix_acl; struct posix_acl *fuse_get_acl(struct inode *inode, int type); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index e018dc3999f4..a52cf2019a58 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) file->f_cred->user_ns != sb->s_user_ns) goto err_fput; + /* + * If we are not in the initial user namespace posix + * acls must be translated. + */ + if (sb->s_user_ns != &init_user_ns) + sb->s_xattr = fuse_no_acl_xattr_handlers; + fc = kmalloc(sizeof(*fc), GFP_KERNEL); err = -ENOMEM; if (!fc) diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c index 3caac46b08b0..433717640f78 100644 --- a/fs/fuse/xattr.c +++ b/fs/fuse/xattr.c @@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler, return fuse_setxattr(inode, name, value, size, flags); } +static bool no_xattr_list(struct dentry *dentry) +{ + return false; +} + +static int no_xattr_get(const struct xattr_handler *handler, + struct dentry *dentry, struct inode *inode, + const char *name, void *value, size_t size) +{ + return -EOPNOTSUPP; +} + +static int no_xattr_set(const struct xattr_handler *handler, + struct dentry *dentry, struct inode *nodee, + const char *name, const void *value, + size_t size, int flags) +{ + return -EOPNOTSUPP; +} + static const struct xattr_handler fuse_xattr_handler = { .prefix = "", .get = fuse_xattr_get, @@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = { &fuse_xattr_handler, NULL }; + +static const struct xattr_handler fuse_no_acl_access_xattr_handler = { + .name = XATTR_NAME_POSIX_ACL_ACCESS, + .flags = ACL_TYPE_ACCESS, + .list = no_xattr_list, + .get = no_xattr_get, + .set = no_xattr_set, +}; + +static const struct xattr_handler fuse_no_acl_default_xattr_handler = { + .name = XATTR_NAME_POSIX_ACL_DEFAULT, + .flags = ACL_TYPE_ACCESS, + .list = no_xattr_list, + .get = no_xattr_get, + .set = no_xattr_set, +}; + +const struct xattr_handler *fuse_no_acl_xattr_handlers[] = { + &fuse_no_acl_access_xattr_handler, + &fuse_no_acl_default_xattr_handler, + &fuse_xattr_handler, + NULL +}; -- 2.14.1 From ebiederm at xmission.com Wed Feb 21 20:29:08 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 21 Feb 2018 14:29:08 -0600 Subject: [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant In-Reply-To: <878tbmf5vl.fsf@xmission.com> References: <878tbmf5vl.fsf@xmission.com> Message-ID: <20180221202908.17258-5-ebiederm@xmission.com> From: Seth Forshee Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Also export current_in_userns() for use by fuse when built as a module. Cc: linux-fsdevel at vger.kernel.org Cc: linux-kernel at vger.kernel.org Cc: "Eric W. Biederman" Cc: Serge Hallyn Cc: Miklos Szeredi Acked-by: Miklos Szeredi Reviewed-by: Serge Hallyn Reviewed-by: "Eric W. Biederman" Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park Signed-off-by: Eric W. Biederman --- fs/fuse/dir.c | 2 +- kernel/user_namespace.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index ad1cfac1942f..d41559a0aa6b 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc) const struct cred *cred; if (fc->allow_other) - return 1; + return current_in_userns(fc->user_ns); cred = current_cred(); if (uid_eq(cred->euid, fc->user_id) && diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 246d4d4ce5c7..492c255e6c5a 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns) { return in_userns(target_ns, current_user_ns()); } +EXPORT_SYMBOL(current_in_userns); static inline struct user_namespace *to_user_ns(struct ns_common *ns) { -- 2.14.1 From mszeredi at redhat.com Thu Feb 22 10:13:36 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Thu, 22 Feb 2018 11:13:36 +0100 Subject: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read In-Reply-To: <20180221202908.17258-1-ebiederm@xmission.com> References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-1-ebiederm@xmission.com> Message-ID: On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman wrote: > At the point of fuse_dev_do_read the user space process that initiated the > action on the fuse filesystem may no longer exist. The process have been > killed or may have fired an asynchronous request and exited. > > If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid, > fc->pid_ns)" will either return a pid of 0, or in the unlikely event that > the pid has been reallocated it can return practically any pid. Any pid is > possible as the pid allocator allocates pid numbers in different pid > namespaces independently. > > The only way to make translation in fuse_dev_do_read reliable is to call > get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in > fuse_dev_do_read. That reference counting in other contexts has been shown > to bounce cache lines between processors and in general be slow. So that is > not desirable. > > The only known user of running the fuse server in a different pid namespace > from the filesystem does not care what the pids are in the fuse messages > so removing this code should not matter. Shouldn't we at least zero out the pid in that case? Thanks, Miklos > > Getting the translation to a server running outside of the pid namespace > of a container can still be achieved by playing setns games at mount time. > It is also possible to add an option to pass a pid namespace into the fuse > filesystem at mount time. > > Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns") > Signed-off-by: "Eric W. Biederman" > --- > fs/fuse/dev.c | 6 ------ > 1 file changed, 6 deletions(-) > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index 5d06384c2cae..0fb58f364fa6 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, > in = &req->in; > reqsize = in->h.len; > > - if (task_active_pid_ns(current) != fc->pid_ns) { > - rcu_read_lock(); > - in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); > - rcu_read_unlock(); > - } > - > /* If request is too large, reply with an error and restart the read */ > if (nbytes < reqsize) { > req->out.h.error = -EIO; > -- > 2.14.1 > From mszeredi at redhat.com Thu Feb 22 10:26:22 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Thu, 22 Feb 2018 11:26:22 +0100 Subject: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids In-Reply-To: <20180221202908.17258-2-ebiederm@xmission.com> References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-2-ebiederm@xmission.com> Message-ID: On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman wrote: > Upon a cursory examinination the uid and gid of a fuse request are > necessary for correct operation. Failing a fuse request where those > values are not reliable seems a straight forward and reliable means of > ensuring that fuse requests with bad data are not sent or processed. > > In most cases the vfs will avoid actions it suspects will cause > an inode write back of an inode with an invalid uid or gid. But that does > not map precisely to what fuse is doing, so test for this and solve > this at the fuse level as well. > > Performing this work in fuse_req_init_context is cheap as the code is > already performing the translation here and only needs to check the > result of the translation to see if things are not representable in > a form the fuse server can handle. > > Signed-off-by: Eric W. Biederman > --- > fs/fuse/dev.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index 0fb58f364fa6..216db3f51a31 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req) > refcount_dec(&req->count); > } > > -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) > +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) > { > - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); > - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); > + req->in.h.uid = from_kuid(&init_user_ns, current_fsuid()); > + req->in.h.gid = from_kgid(&init_user_ns, current_fsgid()); > req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); > + > + return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1)); > } > > void fuse_set_initialized(struct fuse_conn *fc) > @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, > wake_up(&fc->blocked_waitq); > goto out; > } > - > - fuse_req_init_context(fc, req); > __set_bit(FR_WAITING, &req->flags); > if (for_background) > __set_bit(FR_BACKGROUND, &req->flags); > - > + if (unlikely(!fuse_req_init_context(fc, req))) { > + fuse_put_request(fc, req); > + return ERR_PTR(-EOVERFLOW); > + } > return req; > > out: > @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc, > if (!req) > req = get_reserved_req(fc, file); > > - fuse_req_init_context(fc, req); > __set_bit(FR_WAITING, &req->flags); > __clear_bit(FR_BACKGROUND, &req->flags); > + if (unlikely(!fuse_req_init_context(fc, req))) { > + fuse_put_request(fc, req); > + return ERR_PTR(-EOVERFLOW); > + } I think failing the "_nofail" variant is the wrong thing to do. This is called to allocate a FLUSH request on close() and in readdirplus to allocate a FORGET request. Failing the latter results in refcount leak in userspace. Failing the former results in missing unlock on close() of posix locks. Thanks, Miklos From mszeredi at redhat.com Thu Feb 22 11:40:18 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Thu, 22 Feb 2018 12:40:18 +0100 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: <20180221202908.17258-4-ebiederm@xmission.com> References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-4-ebiederm@xmission.com> Message-ID: On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman wrote: > Ensure the translation happens by failing to read or write > posix acls when the filesystem has not indicated it supports > posix acls. For the first iteration this is fine, but we could convert the raw xattrs as well, if we later want to, right? Thanks, Miklos > > This ensures that modern cached posix acl support is available > and used when dealing with posix acls. This is important > because only that path has the code to convernt the uids and > gids in posix acls into the user namespace of a fuse filesystem. > > Signed-off-by: "Eric W. Biederman" > --- > fs/fuse/fuse_i.h | 1 + > fs/fuse/inode.c | 7 +++++++ > fs/fuse/xattr.c | 43 +++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 51 insertions(+) > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 7772e2b4057e..986fa2b043ab 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size); > int fuse_removexattr(struct inode *inode, const char *name); > extern const struct xattr_handler *fuse_xattr_handlers[]; > extern const struct xattr_handler *fuse_acl_xattr_handlers[]; > +extern const struct xattr_handler *fuse_no_acl_xattr_handlers[]; > > struct posix_acl; > struct posix_acl *fuse_get_acl(struct inode *inode, int type); > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index e018dc3999f4..a52cf2019a58 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) > file->f_cred->user_ns != sb->s_user_ns) > goto err_fput; > > + /* > + * If we are not in the initial user namespace posix > + * acls must be translated. > + */ > + if (sb->s_user_ns != &init_user_ns) > + sb->s_xattr = fuse_no_acl_xattr_handlers; > + > fc = kmalloc(sizeof(*fc), GFP_KERNEL); > err = -ENOMEM; > if (!fc) > diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c > index 3caac46b08b0..433717640f78 100644 > --- a/fs/fuse/xattr.c > +++ b/fs/fuse/xattr.c > @@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler, > return fuse_setxattr(inode, name, value, size, flags); > } > > +static bool no_xattr_list(struct dentry *dentry) > +{ > + return false; > +} > + > +static int no_xattr_get(const struct xattr_handler *handler, > + struct dentry *dentry, struct inode *inode, > + const char *name, void *value, size_t size) > +{ > + return -EOPNOTSUPP; > +} > + > +static int no_xattr_set(const struct xattr_handler *handler, > + struct dentry *dentry, struct inode *nodee, > + const char *name, const void *value, > + size_t size, int flags) > +{ > + return -EOPNOTSUPP; > +} > + > static const struct xattr_handler fuse_xattr_handler = { > .prefix = "", > .get = fuse_xattr_get, > @@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = { > &fuse_xattr_handler, > NULL > }; > + > +static const struct xattr_handler fuse_no_acl_access_xattr_handler = { > + .name = XATTR_NAME_POSIX_ACL_ACCESS, > + .flags = ACL_TYPE_ACCESS, > + .list = no_xattr_list, > + .get = no_xattr_get, > + .set = no_xattr_set, > +}; > + > +static const struct xattr_handler fuse_no_acl_default_xattr_handler = { > + .name = XATTR_NAME_POSIX_ACL_DEFAULT, > + .flags = ACL_TYPE_ACCESS, > + .list = no_xattr_list, > + .get = no_xattr_get, > + .set = no_xattr_set, > +}; > + > +const struct xattr_handler *fuse_no_acl_xattr_handlers[] = { > + &fuse_no_acl_access_xattr_handler, > + &fuse_no_acl_default_xattr_handler, > + &fuse_xattr_handler, > + NULL > +}; > -- > 2.14.1 > From ebiederm at xmission.com Thu Feb 22 18:15:00 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 22 Feb 2018 12:15:00 -0600 Subject: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids In-Reply-To: (Miklos Szeredi's message of "Thu, 22 Feb 2018 11:26:22 +0100") References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-2-ebiederm@xmission.com> Message-ID: <87eflc99i3.fsf@xmission.com> Miklos Szeredi writes: > On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman > wrote: >> Upon a cursory examinination the uid and gid of a fuse request are >> necessary for correct operation. Failing a fuse request where those >> values are not reliable seems a straight forward and reliable means of >> ensuring that fuse requests with bad data are not sent or processed. >> >> In most cases the vfs will avoid actions it suspects will cause >> an inode write back of an inode with an invalid uid or gid. But that does >> not map precisely to what fuse is doing, so test for this and solve >> this at the fuse level as well. >> >> Performing this work in fuse_req_init_context is cheap as the code is >> already performing the translation here and only needs to check the >> result of the translation to see if things are not representable in >> a form the fuse server can handle. >> >> Signed-off-by: Eric W. Biederman >> --- >> fs/fuse/dev.c | 20 +++++++++++++------- >> 1 file changed, 13 insertions(+), 7 deletions(-) >> >> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c >> index 0fb58f364fa6..216db3f51a31 100644 >> --- a/fs/fuse/dev.c >> +++ b/fs/fuse/dev.c >> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req) >> refcount_dec(&req->count); >> } >> >> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) >> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) >> { >> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); >> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); >> + req->in.h.uid = from_kuid(&init_user_ns, current_fsuid()); >> + req->in.h.gid = from_kgid(&init_user_ns, current_fsgid()); >> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); >> + >> + return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1)); >> } >> >> void fuse_set_initialized(struct fuse_conn *fc) >> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, >> wake_up(&fc->blocked_waitq); >> goto out; >> } >> - >> - fuse_req_init_context(fc, req); >> __set_bit(FR_WAITING, &req->flags); >> if (for_background) >> __set_bit(FR_BACKGROUND, &req->flags); >> - >> + if (unlikely(!fuse_req_init_context(fc, req))) { >> + fuse_put_request(fc, req); >> + return ERR_PTR(-EOVERFLOW); >> + } >> return req; >> >> out: >> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc, >> if (!req) >> req = get_reserved_req(fc, file); >> >> - fuse_req_init_context(fc, req); >> __set_bit(FR_WAITING, &req->flags); >> __clear_bit(FR_BACKGROUND, &req->flags); >> + if (unlikely(!fuse_req_init_context(fc, req))) { >> + fuse_put_request(fc, req); >> + return ERR_PTR(-EOVERFLOW); >> + } > > I think failing the "_nofail" variant is the wrong thing to do. This > is called to allocate a FLUSH request on close() and in readdirplus to > allocate a FORGET request. Failing the latter results in refcount > leak in userspace. Failing the former results in missing unlock on > close() of posix locks. Doh! You are quite correct. Modifying fuse_get_req_nofail_nopages to fail is a bug. I am thinking the proper solution is to write: static void fuse_req_init_context_nofail(struct fuse_req *req) { req->in.h.uid = 0; req->in.h.gid = 0; req->in.h.pid = 0; } And use that in the nofail case. As it appears neither flush nor the eviction of inodes is a user space triggered action and as such user space identifiers are nonsense in those cases. I will respin this patch shortly. Eric From ebiederm at xmission.com Thu Feb 22 19:04:32 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 22 Feb 2018 13:04:32 -0600 Subject: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read In-Reply-To: (Miklos Szeredi's message of "Thu, 22 Feb 2018 11:13:36 +0100") References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-1-ebiederm@xmission.com> Message-ID: <87fu5s7sn3.fsf@xmission.com> Miklos Szeredi writes: > On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman > wrote: >> At the point of fuse_dev_do_read the user space process that initiated the >> action on the fuse filesystem may no longer exist. The process have been >> killed or may have fired an asynchronous request and exited. >> >> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid, >> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that >> the pid has been reallocated it can return practically any pid. Any pid is >> possible as the pid allocator allocates pid numbers in different pid >> namespaces independently. >> >> The only way to make translation in fuse_dev_do_read reliable is to call >> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in >> fuse_dev_do_read. That reference counting in other contexts has been shown >> to bounce cache lines between processors and in general be slow. So that is >> not desirable. >> >> The only known user of running the fuse server in a different pid namespace >> from the filesystem does not care what the pids are in the fuse messages >> so removing this code should not matter. > > Shouldn't we at least zero out the pid in that case? This is an explicit case of passing a file descriptor between pid namespaces. So I think there are plenty of buyer be ware signs out. So I don't know if there are any real world advantages of zeroing the pid. I can see a case for using the pid namespace of the opener of /dev/fuse instead of the pid namespace of the mounter of the fuse filesystem. Although in practice I would be surprised if they were different. I am very leary about caring during a read operation. Caring about the current processes during read/write tends to break caching, is error prone as the need for this patch demonstrates, and is generally likely to be slower than not caring. So yes we can zero the pid. I don't think it is wise to zero the pid unless we zero the pid in fuse_req_init_context. Eric From ebiederm at xmission.com Thu Feb 22 19:18:33 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 22 Feb 2018 13:18:33 -0600 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: (Miklos Szeredi's message of "Thu, 22 Feb 2018 12:40:18 +0100") References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-4-ebiederm@xmission.com> Message-ID: <87inao6dfa.fsf@xmission.com> Miklos Szeredi writes: > On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman > wrote: >> Ensure the translation happens by failing to read or write >> posix acls when the filesystem has not indicated it supports >> posix acls. > > For the first iteration this is fine, but we could convert the raw > xattrs as well, if we later want to, right? I will say maybe. This is tricky. The code would not be too hard, and the function to do the work posix_acl_fix_xattr_userns already exists in fs/posix_acl.c I don't actually expect that to work longterm. I expect the direction the kernel internals are moving is that all filesystems that implement posix acls will be expected to implement .get_acl and .set_acl. I would have to reread the old thread that got us to this point with posix acls before I could really understand the backwards compatible fuse use case, and I would have to reread the rest of the acl processing in the kernel before I could recall exactly what makes sense. If there was an obvious way to whitelist xattrs that fuse can support for user namespaces I think I would go for that. Just to avoid future problems with future xattrs. Eric From ebiederm at xmission.com Thu Feb 22 22:50:58 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Thu, 22 Feb 2018 16:50:58 -0600 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: <87inao6dfa.fsf@xmission.com> (Eric W. Biederman's message of "Thu, 22 Feb 2018 13:18:33 -0600") References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-4-ebiederm@xmission.com> <87inao6dfa.fsf@xmission.com> Message-ID: <87mv004p0t.fsf@xmission.com> ebiederm at xmission.com (Eric W. Biederman) writes: > Miklos Szeredi writes: > >> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman >> wrote: >>> Ensure the translation happens by failing to read or write >>> posix acls when the filesystem has not indicated it supports >>> posix acls. >> >> For the first iteration this is fine, but we could convert the raw >> xattrs as well, if we later want to, right? > > I will say maybe. This is tricky. The code would not be too hard, > and the function to do the work posix_acl_fix_xattr_userns already > exists in fs/posix_acl.c > > I don't actually expect that to work longterm. I expect the direction > the kernel internals are moving is that all filesystems that implement > posix acls will be expected to implement .get_acl and .set_acl. > > I would have to reread the old thread that got us to this point with > posix acls before I could really understand the backwards compatible > fuse use case, and I would have to reread the rest of the acl processing > in the kernel before I could recall exactly what makes sense. > > If there was an obvious way to whitelist xattrs that fuse can support > for user namespaces I think I would go for that. Just to avoid future > problems with future xattrs. I am remembering why this is such a sticky issue. Today when a posix acl is read from user space the code does: posix_acl_to_xattr(&init_user_ns, ...) in posix_acl_xattr_get posix_acl_fix_xattr_to_user() in getxattr Similary when a posix acl is written from user space the code does: posix_acl_fix_xattr_from_user() in setxattr posix_acl_from_xattr(&init_user_us, ...) in posix_acl_xattr_set If every posix acl supporting filesystem in the kernel would use posix_acl_access_xattr_handler and posix_acl_default_xattr_handler the function posix_acl_fix_xattr_to_user and posix_acl_fix_xattr_from_user and posix_acl_fix_xattr_userns could all be removed and the posix acl handling could be that little bit simpler and faster. So if we could figure out how to use the generic acl support for the old brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much easier to support them long term. Eric From trad.interpp at gmail.com Fri Feb 23 14:22:13 2018 From: trad.interpp at gmail.com (trad.interpp at gmail.com) Date: Fri, 23 Feb 2018 15:22:13 +0100 Subject: Pour vos besoins en traductions Message-ID: <3dae82b4e1c284f8b32ad5af09a3d6d3@gmail.com> Bonjour, Je me permets de vous contacter afin de vous proposer nos services de traduction et d?interpr?tation multilingues. Nous sommes en mesure de traduire vos supports scientifiques, techniques, web, marketing, commerciaux et juridiques en toutes langues. Quelques-unes de nos r?f?rences : Groupe ADF, SIAL, Domaine Les Eminades, WorldSkills Belgium asbl, LU - EFE Luxembourg, CNRS, H?pital Bichat - Claude-Bernard, Universidad Castilla La Mancha, Palais d'Emeraude, Realnewtech RENT 2017, Cabinet d?avocat DAYDE PLANTARD ROCHAS & VIRY, AGM Avocats et bien d'autres... Par simple r?ponse ? ce mail, n?h?sitez pas ? revenir vers moi pour toute demande d?information ou pour un devis gratuit. Cordialement, David CANET Chef de projet T?l. : +33 (0)4 84 49 24 79 trad.interpp at gmail.com www.interppro.net Si vous ne d?sirez plus recevoir notre lettre d'information, cliquez ici From qq7m4mc99 at siren.ocn.ne.jp Fri Feb 23 15:54:53 2018 From: qq7m4mc99 at siren.ocn.ne.jp (Adam Richter) Date: Fri, 23 Feb 2018 07:54:53 -0800 Subject: No subject Message-ID: <1519401304.pFgZetoJXyvP3pFgeeHvOV@mf-smf-ucb025c2> http://add.chattanooga360.com Adam Richter From sargun at sargun.me Mon Feb 26 07:26:54 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 07:26:54 +0000 Subject: [net-next v3 0/2] eBPF seccomp filters Message-ID: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> This patchset enables seccomp filters to be written in eBPF. Although, this patchset doesn't introduce much of the functionality enabled by eBPF, it lays the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE support in order to utilize eBPF seccomp filters, as eBPF filters cannot be retrieved via the ptrace GET_FILTER API. Any user can load a bpf seccomp filter program, and it can be pinned and reused without requiring access to the bpf syscalls. A user only requires the traditional permissions of either being cap_sys_admin, or have no_new_privs set in order to install their rule. The primary reason for not adding maps support in this patchset is to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. If we have a map that the BPF program can read, it can potentially "change" privileges after running. It seems like doing writes only is safe, because it can be pure, and side effect free, and therefore not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come to an agreement, this can be in a follow-up patchset. A benchmark of this patchset is as follows for a very standard eBPF filter: Given this test program: for (i = 10; i < 99999999; i++) syscall(__NR_getpid); If I implement an eBPF filter with PROG_ARRAYs with a program per syscall, and tail call, the numbers are such: ebpf JIT 12.3% slower than native ebpf no JIT 13.6% slower than native seccomp JIT 17.6% slower than native seccomp no JIT 37% slower than native The speed of the traditional seccomp filter increases O(n) with the number of syscalls with discrete rulesets, whereas ebpf is O(1), given any number of syscall filters. Changes since v2: * Rename sample * Code cleanup Changes since v1: * Use a flag to indicate loading an eBPF filter, not a separate command * Remove printk helper * Remove ptrace patch / restore filter / sample * Add some safe helpers Sargun Dhillon (2): bpf, seccomp: Add eBPF filter capabilities bpf: Add eBPF seccomp sample programs arch/Kconfig | 8 ++ include/linux/bpf_types.h | 3 + include/linux/seccomp.h | 3 +- include/uapi/linux/bpf.h | 2 + include/uapi/linux/seccomp.h | 7 +- kernel/bpf/syscall.c | 1 + kernel/seccomp.c | 159 ++++++++++++++++++++++++++++++++++------ samples/bpf/Makefile | 5 ++ samples/bpf/bpf_load.c | 9 ++- samples/bpf/test_seccomp_kern.c | 41 +++++++++++ samples/bpf/test_seccomp_user.c | 46 ++++++++++++ 11 files changed, 255 insertions(+), 29 deletions(-) create mode 100644 samples/bpf/test_seccomp_kern.c create mode 100644 samples/bpf/test_seccomp_user.c -- 2.14.1 From sargun at sargun.me Mon Feb 26 07:27:05 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 07:27:05 +0000 Subject: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities Message-ID: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant to be used for seccomp filters as an alternative to cBPF filters. The program type has relatively limited capabilities in terms of helpers, but that can be extended later on. The eBPF code loading is separated from attachment of the filter, so a privileged user can load the filter, and pass it back to an unprivileged user who can attach it and use it at a later time. In order to attach the filter itself, you need to supply a flag to the seccomp syscall indicating that a eBPF filter is being attached, as opposed to a cBPF one. Verification occurs at program load time, so the user should only receive errors related to attachment. Signed-off-by: Sargun Dhillon --- arch/Kconfig | 8 +++ include/linux/bpf_types.h | 3 + include/linux/seccomp.h | 3 +- include/uapi/linux/bpf.h | 2 + include/uapi/linux/seccomp.h | 7 +- kernel/bpf/syscall.c | 1 + kernel/seccomp.c | 159 ++++++++++++++++++++++++++++++++++++------- 7 files changed, 156 insertions(+), 27 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 76c0b54443b1..8490d35e59d6 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -401,6 +401,14 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config SECCOMP_FILTER_EXTENDED + bool "Extended BPF seccomp filters" + depends on SECCOMP_FILTER && BPF_SYSCALL + depends on !CHECKPOINT_RESTORE + help + Enables seccomp filters to be written in eBPF, as opposed + to just cBPF filters. + config HAVE_GCC_PLUGINS bool help diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 19b8349a3809..945c65c4e461 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event) #ifdef CONFIG_CGROUP_BPF BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) #endif +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index c723a5c4e3ff..a7df3ba6cf25 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -5,7 +5,8 @@ #include #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ - SECCOMP_FILTER_FLAG_LOG) + SECCOMP_FILTER_FLAG_LOG | \ + SECCOMP_FILTER_FLAG_EXTENDED) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index db6bdc375126..5f96cb7ed954 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1,3 +1,4 @@ + /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * @@ -133,6 +134,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_CGROUP_DEVICE, + BPF_PROG_TYPE_SECCOMP, }; enum bpf_attach_type { diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 2a0bd9dd104d..730af6c7ec2e 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -16,10 +16,11 @@ #define SECCOMP_SET_MODE_FILTER 1 #define SECCOMP_GET_ACTION_AVAIL 2 -/* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC 1 -#define SECCOMP_FILTER_FLAG_LOG 2 +/* Valid flags for SECCOMP_SET_MODE_FILTER */ +#define SECCOMP_FILTER_FLAG_TSYNC (1 << 0) +#define SECCOMP_FILTER_FLAG_LOG (1 << 1) +#define SECCOMP_FILTER_FLAG_EXTENDED (1 << 2) /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e24aa3241387..86d6ec8b916d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1202,6 +1202,7 @@ static int bpf_prog_load(union bpf_attr *attr) if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && + type != BPF_PROG_TYPE_SECCOMP && !capable(CAP_SYS_ADMIN)) return -EPERM; diff --git a/kernel/seccomp.c b/kernel/seccomp.c index dc77548167ef..d95c24181a6c 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -37,6 +37,7 @@ #include #include #include +#include /** * struct seccomp_filter - container for seccomp BPF programs @@ -367,17 +368,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter)); - /* - * Installing a seccomp filter requires that the task has - * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. - * This avoids scenarios where unprivileged tasks can affect the - * behavior of privileged children. - */ - if (!task_no_new_privs(current) && - security_capable_noaudit(current_cred(), current_user_ns(), - CAP_SYS_ADMIN) != 0) - return ERR_PTR(-EACCES); - /* Allocate a new seccomp_filter */ sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); if (!sfilter) @@ -423,6 +413,48 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +/** + * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd + * @user_filter: pointer to the user data containing an fd. + * + * Returns 0 on success and non-zero otherwise. + */ +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *user_fd) +{ + struct seccomp_filter *sfilter; + struct bpf_prog *fp; + int fd; + + /* Fetch the fd from userspace */ + if (get_user(fd, (int __user *)user_fd)) + return ERR_PTR(-EFAULT); + + /* Allocate a new seccomp_filter */ + sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN); + if (!sfilter) + return ERR_PTR(-ENOMEM); + + fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP); + if (IS_ERR(fp)) { + kfree(sfilter); + return ERR_CAST(fp); + } + + sfilter->prog = fp; + refcount_set(&sfilter->usage, 1); + + return sfilter; +} +#else +static struct seccomp_filter * +seccomp_prepare_extended_filter(const char __user *filter_fd) +{ + return ERR_PTR(-EINVAL); +} +#endif + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -492,7 +524,10 @@ void get_seccomp_filter(struct task_struct *tsk) static inline void seccomp_filter_free(struct seccomp_filter *filter) { if (filter) { - bpf_prog_destroy(filter->prog); + if (bpf_prog_was_classic(filter->prog)) + bpf_prog_destroy(filter->prog); + else + bpf_prog_put(filter->prog); kfree(filter); } } @@ -844,7 +879,8 @@ static long seccomp_set_mode_strict(void) static long seccomp_set_mode_filter(unsigned int flags, const char __user *filter) { - const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; + /* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */ + const unsigned long filter_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; @@ -853,10 +889,31 @@ static long seccomp_set_mode_filter(unsigned int flags, return -EINVAL; /* Prepare the new filter before holding any locks. */ - prepared = seccomp_prepare_user_filter(filter); + if (flags & SECCOMP_FILTER_FLAG_EXTENDED) + prepared = seccomp_prepare_extended_filter(filter); + else + prepared = seccomp_prepare_user_filter(filter); + if (IS_ERR(prepared)) return PTR_ERR(prepared); + /* + * Installing a seccomp filter requires that the task has + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. + * This avoids scenarios where unprivileged tasks can affect the + * behavior of privileged children. + * + * This is checked after filter preparation because the user + * will get an EINVAL if their filter is invalid prior to the + * EPERM. + */ + if (!task_no_new_privs(current) && + security_capable_noaudit(current_cred(), current_user_ns(), + CAP_SYS_ADMIN) != 0) { + ret = -EACCES; + goto out_free; + } + /* * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags, spin_lock_irq(¤t->sighand->siglock); - if (!seccomp_may_assign_mode(seccomp_mode)) + if (!seccomp_may_assign_mode(filter_mode)) goto out; ret = seccomp_attach_filter(flags, prepared); @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags, /* Do not free the successfully attached filter. */ prepared = NULL; - seccomp_assign_mode(current, seccomp_mode); + seccomp_assign_mode(current, filter_mode); out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) @@ -1040,15 +1097,16 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, if (IS_ERR(filter)) return PTR_ERR(filter); + /* This must be a new non-cBPF filter, since we save + * every cBPF filter's orig_prog above when + * CONFIG_CHECKPOINT_RESTORE is enabled. + */ + ret = -EMEDIUMTYPE; fprog = filter->prog->orig_prog; - if (!fprog) { - /* This must be a new non-cBPF filter, since we save - * every cBPF filter's orig_prog above when - * CONFIG_CHECKPOINT_RESTORE is enabled. - */ - ret = -EMEDIUMTYPE; + if (!fprog) + goto out; + if (!bpf_prog_was_classic(filter->prog)) goto out; - } ret = fprog->len; if (!data) @@ -1241,6 +1299,61 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write, return 0; } +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED +static bool seccomp_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (type != BPF_READ) + return false; + + if (off < 0 || off + size > sizeof(struct seccomp_data)) + return false; + + if (off % size != 0) + return false; + + switch (off) { + case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]): + return (size == sizeof(__u64)); + case bpf_ctx_range(struct seccomp_data, nr): + return (size == FIELD_SIZEOF(struct seccomp_data, nr)); + case bpf_ctx_range(struct seccomp_data, arch): + return (size == FIELD_SIZEOF(struct seccomp_data, arch)); + case bpf_ctx_range(struct seccomp_data, instruction_pointer): + return (size == FIELD_SIZEOF(struct seccomp_data, + instruction_pointer)); + default: + return false; + } +} + +static const struct bpf_func_proto * +seccomp_func_proto(enum bpf_func_id func_id) +{ + switch (func_id) { + case BPF_FUNC_get_current_uid_gid: + return &bpf_get_current_uid_gid_proto; + case BPF_FUNC_ktime_get_ns: + return &bpf_ktime_get_ns_proto; + case BPF_FUNC_get_prandom_u32: + return &bpf_get_prandom_u32_proto; + case BPF_FUNC_get_current_pid_tgid: + return &bpf_get_current_pid_tgid_proto; + default: + return NULL; + } +} + +const struct bpf_prog_ops seccomp_prog_ops = { +}; + +const struct bpf_verifier_ops seccomp_verifier_ops = { + .get_func_proto = seccomp_func_proto, + .is_valid_access = seccomp_is_valid_access, +}; +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */ + static struct ctl_path seccomp_sysctl_path[] = { { .procname = "kernel", }, { .procname = "seccomp", }, -- 2.14.1 From sargun at sargun.me Mon Feb 26 07:27:19 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 07:27:19 +0000 Subject: [net-next v3 2/2] bpf: Add eBPF seccomp sample programs Message-ID: <20180226072716.GA27069@ircssh-2.c.rugged-nimbus-611.internal> This adds a sample program that uses seccomp-eBPF, called seccomp1. It shows the simple ability to code seccomp filters in C. Signed-off-by: Sargun Dhillon --- samples/bpf/Makefile | 5 +++++ samples/bpf/bpf_load.c | 9 ++++++-- samples/bpf/test_seccomp_kern.c | 41 ++++++++++++++++++++++++++++++++++++ samples/bpf/test_seccomp_user.c | 46 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 99 insertions(+), 2 deletions(-) create mode 100644 samples/bpf/test_seccomp_kern.c create mode 100644 samples/bpf/test_seccomp_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ec3fc8d88e87..05f21988775f 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -43,6 +43,7 @@ hostprogs-y += xdp_redirect_cpu hostprogs-y += xdp_monitor hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp +hostprogs-y += test_seccomp # Libbpf dependencies LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o @@ -93,6 +94,8 @@ xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o +test_seccomp-objs := bpf_load.o $(LIBBPF) test_seccomp_user.o + # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -144,6 +147,7 @@ always += xdp_monitor_kern.o always += xdp_rxq_info_kern.o always += xdp2skb_meta_kern.o always += syscall_tp_kern.o +always += test_seccomp_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -188,6 +192,7 @@ HOSTLOADLIBES_xdp_redirect_cpu += -lelf HOSTLOADLIBES_xdp_monitor += -lelf HOSTLOADLIBES_xdp_rxq_info += -lelf HOSTLOADLIBES_syscall_tp += -lelf +HOSTLOADLIBES_test_seccomp += -lelf # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 69806d74fa53..856bc8b93916 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -67,6 +67,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0; bool is_sockops = strncmp(event, "sockops", 7) == 0; bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0; + bool is_seccomp = strncmp(event, "seccomp", 7) == 0; size_t insns_cnt = size / sizeof(struct bpf_insn); enum bpf_prog_type prog_type; char buf[256]; @@ -96,6 +97,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_type = BPF_PROG_TYPE_SOCK_OPS; } else if (is_sk_skb) { prog_type = BPF_PROG_TYPE_SK_SKB; + } else if (is_seccomp) { + prog_type = BPF_PROG_TYPE_SECCOMP; } else { printf("Unknown event '%s'\n", event); return -1; @@ -110,7 +113,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) prog_fd[prog_cnt++] = fd; - if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk) + if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk || + is_seccomp) return 0; if (is_socket || is_sockops || is_sk_skb) { @@ -589,7 +593,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map) memcmp(shname, "socket", 6) == 0 || memcmp(shname, "cgroup/", 7) == 0 || memcmp(shname, "sockops", 7) == 0 || - memcmp(shname, "sk_skb", 6) == 0) { + memcmp(shname, "sk_skb", 6) == 0 || + memcmp(shname, "seccomp", 7) == 0) { ret = load_and_attach(shname, data->d_buf, data->d_size); if (ret != 0) diff --git a/samples/bpf/test_seccomp_kern.c b/samples/bpf/test_seccomp_kern.c new file mode 100644 index 000000000000..a0dd39b4ba16 --- /dev/null +++ b/samples/bpf/test_seccomp_kern.c @@ -0,0 +1,41 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include "bpf_helpers.h" +#include +#include + +#if defined(__x86_64__) +#define ARCH AUDIT_ARCH_X86_64 +#elif defined(__i386__) +#define ARCH AUDIT_ARCH_I386 +#else +#endif + +#ifdef ARCH +/* Returns EPERM when trying to close fd 999 */ +SEC("seccomp") +int bpf_prog1(struct seccomp_data *ctx) +{ + /* + * Make sure this BPF program is being run on the same architecture it + * was compiled on. + */ + if (ctx->arch != ARCH) + return SECCOMP_RET_ERRNO | EPERM; + if (ctx->nr == __NR_close && ctx->args[0] == 999) + return SECCOMP_RET_ERRNO | EPERM; + + return SECCOMP_RET_ALLOW; +} +#else +#warning Architecture not supported -- Blocking all syscalls +SEC("seccomp") +int bpf_prog1(struct seccomp_data *ctx) +{ + return SECCOMP_RET_ERRNO | EPERM; +} +#endif + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/test_seccomp_user.c b/samples/bpf/test_seccomp_user.c new file mode 100644 index 000000000000..225db14217a2 --- /dev/null +++ b/samples/bpf/test_seccomp_user.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" +#include +#include +#include +#include +#include +#include + +int main(int argc, char **argv) +{ + char filename[256]; + + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + /* set new_new_privs so non-privileged users can attach filters */ + if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { + perror("prctl(NO_NEW_PRIVS)"); + return 1; + } + + if (syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, + SECCOMP_FILTER_FLAG_EXTENDED, &prog_fd)) { + perror("seccomp"); + return 1; + } + + close(111); + assert(errno == EBADF); + close(999); + assert(errno == EPERM); + + printf("close syscall successfully filtered\n"); + return 0; +} -- 2.14.1 From mszeredi at redhat.com Mon Feb 26 07:47:21 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Mon, 26 Feb 2018 08:47:21 +0100 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: <87mv004p0t.fsf@xmission.com> References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-4-ebiederm@xmission.com> <87inao6dfa.fsf@xmission.com> <87mv004p0t.fsf@xmission.com> Message-ID: On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman wrote: > So if we could figure out how to use the generic acl support for the old > brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much > easier to support them long term. Simplest and most robust way seems to be to do everything the same (as with FUSE_POSIX_ACL) but tell the vfs not to cache the acl. Thanks, Miklos From trad.interpp at gmail.com Mon Feb 26 15:39:15 2018 From: trad.interpp at gmail.com (trad.interpp at gmail.com) Date: Mon, 26 Feb 2018 16:39:15 +0100 Subject: Pour vos besoins en traductions Message-ID: <32b50a97065e5652a07ab5972dfb6e5d@gmail.com> Bonjour, Je me permets de vous contacter afin de vous proposer nos services de traduction et d?interpr?tation multilingues. Nous sommes en mesure de traduire vos supports scientifiques, techniques, web, marketing, commerciaux et juridiques en toutes langues. Quelques-unes de nos r?f?rences : Groupe ADF, SIAL, Domaine Les Eminades, WorldSkills Belgium asbl, LU - EFE Luxembourg, CNRS, H?pital Bichat - Claude-Bernard, Universidad Castilla La Mancha, Palais d'Emeraude, Realnewtech RENT 2017, Cabinet d?avocat DAYDE PLANTARD ROCHAS & VIRY, AGM Avocats et bien d'autres... Par simple r?ponse ? ce mail, n?h?sitez pas ? revenir vers moi pour toute demande d?information ou pour un devis gratuit. Cordialement, David CANET Chef de projet T?l. : +33 (0)4 84 49 24 79 trad.interpp at gmail.com www.interppro.net Si vous ne d?sirez plus recevoir notre lettre d'information, cliquez ici From ebiederm at xmission.com Mon Feb 26 16:35:17 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 10:35:17 -0600 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: (Miklos Szeredi's message of "Mon, 26 Feb 2018 08:47:21 +0100") References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-4-ebiederm@xmission.com> <87inao6dfa.fsf@xmission.com> <87mv004p0t.fsf@xmission.com> Message-ID: <87zi3v1zga.fsf@xmission.com> Miklos Szeredi writes: > On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman > wrote: > >> So if we could figure out how to use the generic acl support for the old >> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much >> easier to support them long term. > > Simplest and most robust way seems to be to do everything the same (as > with FUSE_POSIX_ACL) but tell the vfs not to cache the acl. Good point. That sounds like for the !fc->posix_acl case we just need a careful use of "forget_all_cached_acls(inode)". I will take a quick look at that, and see if that is easy/sufficient to cover the legacy fuse case. Otherwise I will go with what I already have here. That feels like a better path. And internally I would call what is today fc->posix_acl fc->cached_posix_acl. To better convey the intent. Fingers crossed. Eric From Nettie.Martinez at prosoftwareguide.com Mon Feb 26 17:27:43 2018 From: Nettie.Martinez at prosoftwareguide.com (Nettie Martinez) Date: Mon, 26 Feb 2018 11:27:43 -0600 Subject: Applicant Tracking System(ATS) Software Pricing Guide Message-ID: Software Advice (TM) Find The Best 2018 Applicant Tracking System Software [1]Learn More ??? Reduce the time and expense of hiring ??? Eliminate the need for paperwork ??? Support the growth of your business One Minute Could Save You Days of Frustration With the multitude of applicant tracking system software solutions available today, selecting the right system for your organization can be challenging. Software Advice? is a trusted resource for software buyers. We provide detailed reviews and research on thousands of software applications. [2]Get free price quotes and relevant recommendations on systems that match your exact needs! [3]Compare Systems Wall Street Journal New York Times Fortune CNN Bloomberg Other Offers Recommended For You Free Human Resources Software Pricing Guide [4]Download Free Social Media Recruiting eBook [5]Download You are receiving this email because you have registered for a download, white paper, comparison guide,or have elected to receive information or offers from Compare-Softwares Inc. You're signed up as containers at lists.linux-foundation.org. Our mailing address is: Compare-Softwares Inc, Two Ravinia, Suite 500, Atlanta, Georgia, 30346, [6]Contact If you do not want to receive any emails from Compare-Softwares Inc [7]unsubscribe here References 1. http://www.softwareadvice.com/hr/applicant-tracking-software-comparison/find-best/?utm_source=Dataprospex&utm_medium=affiliate&utm_campaign=ats-dec-2017-marketing&utm_content=email-top 2. http://www.softwareadvice.com/hr/applicant-tracking-software-comparison/find-best/?utm_source=Dataprospex&utm_medium=affiliate&utm_campaign=ats-dec-2017-marketing&utm_content=email-text 3. http://www.softwareadvice.com/hr/applicant-tracking-software-comparison/find-best/?utm_source=Dataprospex&utm_medium=affiliate&utm_campaign=ats-dec-2017-marketing&utm_content=email-bottom 4. http://www.softwareadvice.com/hr/pricing-guide/?utm_source=Dataprospex&utm_medium=affiliate&utm_campaign=ats-dec-2017-marketing&utm_content=micro-conversion-partners 5. http://www.softwareadvice.com/resources/28-social-media-recruiting-tools/?utm_source=Dataprospex&utm_medium=affiliate&utm_campaign=ats-dec-2017-marketing&utm_content=micro-conversion-marketing 6. https://goo.gl/68NN4e 7. https://goo.gl/68NN4e From ebiederm at xmission.com Mon Feb 26 21:51:16 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 15:51:16 -0600 Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns In-Reply-To: <87zi3v1zga.fsf@xmission.com> (Eric W. Biederman's message of "Mon, 26 Feb 2018 10:35:17 -0600") References: <878tbmf5vl.fsf@xmission.com> <20180221202908.17258-4-ebiederm@xmission.com> <87inao6dfa.fsf@xmission.com> <87mv004p0t.fsf@xmission.com> <87zi3v1zga.fsf@xmission.com> Message-ID: <87lgff1ktn.fsf@xmission.com> ebiederm at xmission.com (Eric W. Biederman) writes: > Miklos Szeredi writes: > >> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman >> wrote: >> >>> So if we could figure out how to use the generic acl support for the old >>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much >>> easier to support them long term. >> >> Simplest and most robust way seems to be to do everything the same (as >> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl. > > Good point. That sounds like for the !fc->posix_acl case we just > need a careful use of "forget_all_cached_acls(inode)". > > I will take a quick look at that, and see if that is easy/sufficient to > cover the legacy fuse case. Otherwise I will go with what I already > have here. > > That feels like a better path. And internally I would call what is > today fc->posix_acl fc->cached_posix_acl. To better convey the intent. > Fingers crossed. It looks like simply setting "inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;" is the secret sauce needed to disable caching in the legacy case and make everything work. I had to tweak the calls to forget_all_cached_acls so that won't clear the ACL_DONT_CACHE status but otherwise that was an absolutely trivial change to combine those two code paths. I will post my updated patches shortly. Eric From alexei.starovoitov at gmail.com Mon Feb 26 23:04:20 2018 From: alexei.starovoitov at gmail.com (Alexei Starovoitov) Date: Mon, 26 Feb 2018 15:04:20 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: <20180226230418.46nczgkh5csakyu7@ast-mbp> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: > This patchset enables seccomp filters to be written in eBPF. Although, this > patchset doesn't introduce much of the functionality enabled by eBPF, it lays > the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE > support in order to utilize eBPF seccomp filters, as eBPF filters cannot be > retrieved via the ptrace GET_FILTER API. this was discussed multiple times in the past. In eBPF land it's practically impossible to do checkpoint/restore of the whole bpf program/map graph. > Any user can load a bpf seccomp filter program, and it can be pinned and > reused without requiring access to the bpf syscalls. A user only requires > the traditional permissions of either being cap_sys_admin, or have > no_new_privs set in order to install their rule. > > The primary reason for not adding maps support in this patchset is > to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. > If we have a map that the BPF program can read, it can potentially > "change" privileges after running. It seems like doing writes only > is safe, because it can be pure, and side effect free, and therefore > not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come > to an agreement, this can be in a follow-up patchset. readonly maps already exist. See BPF_F_RDONLY. Is that not enough? > A benchmark of this patchset is as follows for a very standard eBPF filter: > > Given this test program: > for (i = 10; i < 99999999; i++) syscall(__NR_getpid); > > If I implement an eBPF filter with PROG_ARRAYs with a program per syscall, > and tail call, the numbers are such: > ebpf JIT 12.3% slower than native > ebpf no JIT 13.6% slower than native > seccomp JIT 17.6% slower than native > seccomp no JIT 37% slower than native the perf gains are misleading, since patches don't enable bpf_tail_call. The main statement I want to hear from seccomp maintainers before proceeding any further on this that enabling eBPF in seccomp won't lead to seccomp folks arguing against changes in bpf core (like verifier) just because it's used by seccomp. It must be spelled out in the commit log with explicit Ack. From keescook at chromium.org Mon Feb 26 23:20:15 2018 From: keescook at chromium.org (Kees Cook) Date: Mon, 26 Feb 2018 15:20:15 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <20180226230418.46nczgkh5csakyu7@ast-mbp> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov wrote: > On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >> This patchset enables seccomp filters to be written in eBPF. Although, this >> [...] > The main statement I want to hear from seccomp maintainers before > proceeding any further on this that enabling eBPF in seccomp won't lead > to seccomp folks arguing against changes in bpf core (like verifier) > just because it's used by seccomp. > It must be spelled out in the commit log with explicit Ack. The primary thing I'm concerned about with eBPF and seccomp is side-effects from eBPF programs running at syscall time. This is an extremely sensitive area, and I want to be sure there won't be feature-creep here that leads to seccomp getting into a bad state. As long as seccomp can continue have its own verifier, I *think* this will be fine, though, again I remain concerned about maps, etc. I'm still reviewing these patches and how they might provide overlap with Tycho's needs too, etc. -Kees -- Kees Cook Pixel Security From ebiederm at xmission.com Mon Feb 26 23:52:21 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:52:21 -0600 Subject: [PATCH v7 0/7] fuse: mounts from non-init user namespaces In-Reply-To: <878tbmf5vl.fsf@xmission.com> (Eric W. Biederman's message of "Wed, 21 Feb 2018 14:24:30 -0600") References: <878tbmf5vl.fsf@xmission.com> Message-ID: <87po4rz4ui.fsf_-_@xmission.com> This patchset builds on the work by Donsu Park and Seth Forshee and is reduced to the set of patches that just affect fuse. The non-fuse patches are far enough along we can ignore them except possibly for the question of when does FS_USERNS_MOUNT get set in fuse_fs_type. Fuse with a block device has been left as an exercise for a later time. Since v5 I changed the core of this patchset around as the previous patches were showing signs of bitrot. Some important explanations were missing, some important functionality was missing, and xattr handling was completely absent. Since v6 I have: - Removed the failure case from fuse_get_req_nofail_nopages that I added. - Updated fuse to always to use posix_acl_access_xattr_handler, and posix_acl_default_xattr_handler, by teaching fuse to set ACL_DONT_CACHE when FUSE_POSIX_ACL is not set. Miklos can you take a look and see what you think? I think this much of the fuse changes are ready, and as such I would like to get them in this development cycle if possible. These changes are also available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v7 Eric W. Biederman (6): fuse: Remove the buggy retranslation of pids in fuse_dev_do_read fuse: Fail all requests with invalid uids or gids fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS fuse: Simplfiy the posix acl handling logic. fuse: Support fuse filesystems outside of init_user_ns Seth Forshee (1): fuse: Restrict allow_other to the superblock's namespace or a descendant fs/fuse/acl.c | 10 +++++----- fs/fuse/cuse.c | 7 ++++++- fs/fuse/dev.c | 30 +++++++++++++++++------------- fs/fuse/dir.c | 27 +++++++++++++-------------- fs/fuse/fuse_i.h | 11 ++++++++--- fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++--------------- fs/fuse/xattr.c | 6 +----- fs/posix_acl.c | 7 +++++-- kernel/user_namespace.c | 1 + 9 files changed, 85 insertions(+), 58 deletions(-) Eric From ebiederm at xmission.com Mon Feb 26 23:52:56 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:52:56 -0600 Subject: [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-1-ebiederm@xmission.com> At the point of fuse_dev_do_read the user space process that initiated the action on the fuse filesystem may no longer exist. The process have been killed or may have fired an asynchronous request and exited. If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)" will either return a pid of 0, or in the unlikely event that the pid has been reallocated it can return practically any pid. Any pid is possible as the pid allocator allocates pid numbers in different pid namespaces independently. The only way to make translation in fuse_dev_do_read reliable is to call get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in fuse_dev_do_read. That reference counting in other contexts has been shown to bounce cache lines between processors and in general be slow. So that is not desirable. The only known user of running the fuse server in a different pid namespace from the filesystem does not care what the pids are in the fuse messages so removing this code should not matter. Getting the translation to a server running outside of the pid namespace of a container can still be achieved by playing setns games at mount time. It is also possible to add an option to pass a pid namespace into the fuse filesystem at mount time. Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns") Signed-off-by: "Eric W. Biederman" --- fs/fuse/dev.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 5d06384c2cae..0fb58f364fa6 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, in = &req->in; reqsize = in->h.len; - if (task_active_pid_ns(current) != fc->pid_ns) { - rcu_read_lock(); - in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns)); - rcu_read_unlock(); - } - /* If request is too large, reply with an error and restart the read */ if (nbytes < reqsize) { req->out.h.error = -EIO; -- 2.14.1 From ebiederm at xmission.com Mon Feb 26 23:52:57 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:52:57 -0600 Subject: [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-2-ebiederm@xmission.com> Upon a cursory examinination the uid and gid of a fuse request are necessary for correct operation. Failing a fuse request where those values are not reliable seems a straight forward and reliable means of ensuring that fuse requests with bad data are not sent or processed. In most cases the vfs will avoid actions it suspects will cause an inode write back of an inode with an invalid uid or gid. But that does not map precisely to what fuse is doing, so test for this and solve this at the fuse level as well. Performing this work in fuse_req_init_context is cheap as the code is already performing the translation here and only needs to check the result of the translation to see if things are not representable in a form the fuse server can handle. Signed-off-by: Eric W. Biederman --- fs/fuse/dev.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 0fb58f364fa6..2886a56d5f61 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req) refcount_dec(&req->count); } -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(&init_user_ns, current_fsuid()); + req->in.h.gid = from_kgid(&init_user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); + + return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1)); +} + +static void fuse_req_init_context_nofail(struct fuse_req *req) +{ + req->in.h.uid = 0; + req->in.h.gid = 0; + req->in.h.pid = 0; } void fuse_set_initialized(struct fuse_conn *fc) @@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, wake_up(&fc->blocked_waitq); goto out; } - - fuse_req_init_context(fc, req); __set_bit(FR_WAITING, &req->flags); if (for_background) __set_bit(FR_BACKGROUND, &req->flags); - + if (unlikely(!fuse_req_init_context(fc, req))) { + fuse_put_request(fc, req); + return ERR_PTR(-EOVERFLOW); + } return req; out: @@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc, if (!req) req = get_reserved_req(fc, file); - fuse_req_init_context(fc, req); + fuse_req_init_context_nofail(req); __set_bit(FR_WAITING, &req->flags); __clear_bit(FR_BACKGROUND, &req->flags); return req; -- 2.14.1 From ebiederm at xmission.com Mon Feb 26 23:52:58 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:52:58 -0600 Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-3-ebiederm@xmission.com> Fuse is about to join overlayfs in relying on get_acl respecting ACL_DONT_CACHE so update the documentation in get_acl to reflect that fact. The comment and this change description should give people a clue that respecting ACL_DONT_CACHE in get_acl is important, and they should audit the filesystems before removing that support. Additionaly update the comment above the call to get_acl itself and remove the wrong information that an implementation of get_acl can prevent caching by calling forget_cached_acl. Replace that with the correct information that to prevent caching all that is necessary is to set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE when the inode is initialized. Signed-off-by: "Eric W. Biederman" --- fs/posix_acl.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/posix_acl.c b/fs/posix_acl.c index 2fd0fde16fe1..3c24fc263401 100644 --- a/fs/posix_acl.c +++ b/fs/posix_acl.c @@ -121,14 +121,17 @@ struct posix_acl *get_acl(struct inode *inode, int type) * could wait for that other task to complete its job, but it's easier * to just call ->get_acl to fetch the ACL ourself. (This is going to * be an unlikely race.) + * + * ACL_DONT_CACHE is treated as another task updating the acl and + * remains set. */ if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED) /* fall through */ ; /* * Normally, the ACL returned by ->get_acl will be cached. - * A filesystem can prevent that by calling - * forget_cached_acl(inode, type) in ->get_acl. + * A filesystem can prevent that by calling setting + * inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE. * * If the filesystem doesn't have a get_acl() function at all, we'll * just create the negative cache entry. -- 2.14.1 From ebiederm at xmission.com Mon Feb 26 23:52:59 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:52:59 -0600 Subject: [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-4-ebiederm@xmission.com> When FUSE_GETXATTR will never return anything call cache_no_acl to cache that state in the vfs as well in fuse with fc->no_getxattr. The only code path this affects are the code paths that call fuse_get_acl and caching a NULL or returning it immediately is exactly the same effect so this should not effect anything. This keeps the vfs from waisting it's time calling down into fuse when fuse isn't going to do anything, and it makes it clear when a NULL should be cached for optimal performance. Signed-off-by: "Eric W. Biederman" --- fs/fuse/xattr.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c index 3caac46b08b0..0520a4f47226 100644 --- a/fs/fuse/xattr.c +++ b/fs/fuse/xattr.c @@ -82,6 +82,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value, ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX); if (ret == -ENOSYS) { fc->no_getxattr = 1; + cache_no_acl(inode); ret = -EOPNOTSUPP; } return ret; -- 2.14.1 From ebiederm at xmission.com Mon Feb 26 23:53:00 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:53:00 -0600 Subject: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic. In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-5-ebiederm@xmission.com> Rename the fuse connection flag posix_acl to cached_posix_acl as that is what it actually means. That fuse will cache and operate on the cached value of the posix acl. When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode so that get_acl and friends won't cache the acl values even if they are called. Replace forget_all_cached_acls with fuse_forget_cached_acls. This wrapper only takes effect when cached_posix_acl is true to prevent losing the nocache or noxattr status in when posix acls are not cached. Always use posix_acl_access_xattr_handler so the fuse code benefits from the generic posix acl handlers as much as possible. This will become important as the code works on translation of uid and gid in the posix acls when fuse is not mounted in the initial user namespace. Signed-off-by: "Eric W. Biederman" --- fs/fuse/acl.c | 6 +++--- fs/fuse/dir.c | 11 +++++------ fs/fuse/fuse_i.h | 5 +++-- fs/fuse/inode.c | 13 ++++++++++--- fs/fuse/xattr.c | 5 ----- 5 files changed, 21 insertions(+), 19 deletions(-) diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index ec85765502f1..8fb2153dbf50 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type) void *value = NULL; struct posix_acl *acl; - if (!fc->posix_acl || fc->no_getxattr) + if (fc->no_getxattr) return NULL; if (type == ACL_TYPE_ACCESS) @@ -53,7 +53,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) const char *name; int ret; - if (!fc->posix_acl || fc->no_setxattr) + if (fc->no_setxattr) return -EOPNOTSUPP; if (type == ACL_TYPE_ACCESS) @@ -92,7 +92,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) } else { ret = fuse_removexattr(inode, name); } - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); fuse_invalidate_attr(inode); return ret; diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 24967382a7b1..a44ca509db4f 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -237,7 +237,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags) if (ret || (outarg.attr.mode ^ inode->i_mode) & S_IFMT) goto invalid; - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); fuse_change_attributes(inode, &outarg.attr, entry_attr_timeout(&outarg), attr_version); @@ -930,7 +930,7 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file, int err = 0; if (time_before64(fi->i_time, get_jiffies_64())) { - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); err = fuse_do_getattr(inode, stat, file); } else if (stat) { generic_fillattr(inode, stat); @@ -1076,7 +1076,7 @@ static int fuse_perm_getattr(struct inode *inode, int mask) if (mask & MAY_NOT_BLOCK) return -ECHILD; - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); return fuse_do_getattr(inode, NULL, NULL); } @@ -1246,7 +1246,7 @@ static int fuse_direntplus_link(struct file *file, fi->nlookup++; spin_unlock(&fc->lock); - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); fuse_change_attributes(inode, &o->attr, entry_attr_timeout(o), attr_version); @@ -1764,8 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr) * If filesystem supports acls it may have updated acl xattrs in * the filesystem, so forget cached acls for the inode. */ - if (fc->posix_acl) - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); /* Directory mode changed, may need to revalidate access */ if (d_is_dir(entry) && (attr->ia_valid & ATTR_MODE)) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index c4c093bbf456..3cf296d60bc0 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -619,7 +619,7 @@ struct fuse_conn { unsigned no_lseek:1; /** Does the filesystem support posix acls? */ - unsigned posix_acl:1; + unsigned cached_posix_acl:1; /** Check permissions based on the file mode or not? */ unsigned default_permissions:1; @@ -913,6 +913,8 @@ void fuse_release_nowrite(struct inode *inode); u64 fuse_get_attr_version(struct fuse_conn *fc); +void fuse_forget_cached_acls(struct inode *inode); + /** * File-system tells the kernel to invalidate cache for the given node id. */ @@ -974,7 +976,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value, ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size); int fuse_removexattr(struct inode *inode, const char *name); extern const struct xattr_handler *fuse_xattr_handlers[]; -extern const struct xattr_handler *fuse_acl_xattr_handlers[]; struct posix_acl; struct posix_acl *fuse_get_acl(struct inode *inode, int type); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 624f18bbfd2b..0c3ccca7c554 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -313,6 +313,8 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid, if (!fc->writeback_cache || !S_ISREG(attr->mode)) inode->i_flags |= S_NOCMTIME; inode->i_generation = generation; + if (!fc->cached_posix_acl) + inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE; fuse_init_inode(inode, attr); unlock_new_inode(inode); } else if ((inode->i_mode ^ attr->mode) & S_IFMT) { @@ -331,6 +333,12 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid, return inode; } +void fuse_forget_cached_acls(struct inode *inode) +{ + if (get_fuse_conn(inode)->cached_posix_acl) + forget_all_cached_acls(inode); +} + int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid, loff_t offset, loff_t len) { @@ -343,7 +351,7 @@ int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid, return -ENOENT; fuse_invalidate_attr(inode); - forget_all_cached_acls(inode); + fuse_forget_cached_acls(inode); if (offset >= 0) { pg_start = offset >> PAGE_SHIFT; if (len <= 0) @@ -915,8 +923,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req) fc->sb->s_time_gran = arg->time_gran; if ((arg->flags & FUSE_POSIX_ACL)) { fc->default_permissions = 1; - fc->posix_acl = 1; - fc->sb->s_xattr = fuse_acl_xattr_handlers; + fc->cached_posix_acl = 1; } } else { ra_pages = fc->max_read / PAGE_SIZE; diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c index 0520a4f47226..48a95e1bb020 100644 --- a/fs/fuse/xattr.c +++ b/fs/fuse/xattr.c @@ -200,11 +200,6 @@ static const struct xattr_handler fuse_xattr_handler = { }; const struct xattr_handler *fuse_xattr_handlers[] = { - &fuse_xattr_handler, - NULL -}; - -const struct xattr_handler *fuse_acl_xattr_handlers[] = { &posix_acl_access_xattr_handler, &posix_acl_default_xattr_handler, &fuse_xattr_handler, -- 2.14.1 From ebiederm at xmission.com Mon Feb 26 23:53:01 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:53:01 -0600 Subject: [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-6-ebiederm@xmission.com> In order to support mounts from namespaces other than init_user_ns, fuse must translate uids and gids to/from the userns of the process servicing requests on /dev/fuse. This patch does that, with a couple of restrictions on the namespace: - The userns for the fuse connection is fixed to the namespace from which /dev/fuse is opened. - The namespace must be the same as s_user_ns. These restrictions simplify the implementation by avoiding the need to pass around userns references and by allowing fuse to rely on the checks in setattr_prepare for ownership changes. Either restriction could be relaxed in the future if needed. For cuse the userns used is the opener of /dev/cuse. Semantically the cuse support does not appear safe for unprivileged users. Practically the permissions on /dev/cuse only make it accessible to the global root user. If something slips through the cracks in a user namespace the only users who will be able to use the cuse device are those users mapped into the user namespace. Translation in the posix acl is updated to use the uuser namespace of the filesystem. Avoiding cases which might bypass this translation is handled in a following change. This change is stronlgy based on a similar change from Seth Forshee and Dongsu Park. Cc: linux-fsdevel at vger.kernel.org Cc: linux-kernel at vger.kernel.org Cc: Miklos Szeredi Cc: Cc: Dongsu Park Signed-off-by: Eric W. Biederman --- fs/fuse/acl.c | 4 ++-- fs/fuse/cuse.c | 7 ++++++- fs/fuse/dev.c | 4 ++-- fs/fuse/dir.c | 14 +++++++------- fs/fuse/fuse_i.h | 6 +++++- fs/fuse/inode.c | 31 +++++++++++++++++++------------ 6 files changed, 41 insertions(+), 25 deletions(-) diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index 8fb2153dbf50..5a67c80e21d6 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type) return ERR_PTR(-ENOMEM); size = fuse_getxattr(inode, name, value, PAGE_SIZE); if (size > 0) - acl = posix_acl_from_xattr(&init_user_ns, value, size); + acl = posix_acl_from_xattr(fc->user_ns, value, size); else if ((size == 0) || (size == -ENODATA) || (size == -EOPNOTSUPP && fc->no_getxattr)) acl = NULL; @@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type) if (!value) return -ENOMEM; - ret = posix_acl_to_xattr(&init_user_ns, acl, value, size); + ret = posix_acl_to_xattr(fc->user_ns, acl, value, size); if (ret < 0) { kfree(value); return ret; diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index e9e97803442a..036ee477669e 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "fuse_i.h" @@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file) if (!cc) return -ENOMEM; - fuse_conn_init(&cc->fc); + /* + * Limit the cuse channel to requests that can + * be represented in file->f_cred->user_ns. + */ + fuse_conn_init(&cc->fc, file->f_cred->user_ns); fud = fuse_dev_alloc(&cc->fc); if (!fud) { diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 2886a56d5f61..fce7915aea13 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req) static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid(&init_user_ns, current_fsuid()); - req->in.h.gid = from_kgid(&init_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1)); diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index a44ca509db4f..79cca1687457 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, stat->ino = attr->ino; stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); stat->nlink = attr->nlink; - stat->uid = make_kuid(&init_user_ns, attr->uid); - stat->gid = make_kgid(&init_user_ns, attr->gid); + stat->uid = make_kuid(fc->user_ns, attr->uid); + stat->gid = make_kgid(fc->user_ns, attr->gid); stat->rdev = inode->i_rdev; stat->atime.tv_sec = attr->atime; stat->atime.tv_nsec = attr->atimensec; @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, - bool trust_local_cmtime) +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, + struct fuse_setattr_in *arg, bool trust_local_cmtime) { unsigned ivalid = iattr->ia_valid; if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; if (ivalid & ATTR_UID) - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid); + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); if (ivalid & ATTR_GID) - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid); + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid); if (ivalid & ATTR_SIZE) arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size; if (ivalid & ATTR_ATIME) { @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); - iattr_to_fattr(attr, &inarg, trust_local_cmtime); + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime); if (file) { struct fuse_file *ff = file->private_data; inarg.valid |= FATTR_FH; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 3cf296d60bc0..eba0beea8634 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -26,6 +26,7 @@ #include #include #include +#include /** Max number of pages that can be used in a single read request */ #define FUSE_MAX_PAGES_PER_REQ 32 @@ -466,6 +467,9 @@ struct fuse_conn { /** The pid namespace for this mount */ struct pid_namespace *pid_ns; + /** The user namespace for this mount */ + struct user_namespace *user_ns; + /** Maximum read size */ unsigned max_read; @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc); /** * Initialize fuse_conn */ -void fuse_conn_init(struct fuse_conn *fc); +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns); /** * Release reference to fuse_conn diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 0c3ccca7c554..cd3d29610688 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, inode->i_ino = fuse_squash_ino(attr->ino); inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); set_nlink(inode, attr->nlink); - inode->i_uid = make_kuid(&init_user_ns, attr->uid); - inode->i_gid = make_kgid(&init_user_ns, attr->gid); + inode->i_uid = make_kuid(fc->user_ns, attr->uid); + inode->i_gid = make_kgid(fc->user_ns, attr->gid); inode->i_blocks = attr->blocks; inode->i_atime.tv_sec = attr->atime; inode->i_atime.tv_nsec = attr->atimensec; @@ -485,7 +485,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res) return err; } -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev, + struct user_namespace *user_ns) { char *p; memset(d, 0, sizeof(struct fuse_mount_data)); @@ -521,7 +522,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) case OPT_USER_ID: if (fuse_match_uint(&args[0], &uv)) return 0; - d->user_id = make_kuid(current_user_ns(), uv); + d->user_id = make_kuid(user_ns, uv); if (!uid_valid(d->user_id)) return 0; d->user_id_present = 1; @@ -530,7 +531,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev) case OPT_GROUP_ID: if (fuse_match_uint(&args[0], &uv)) return 0; - d->group_id = make_kgid(current_user_ns(), uv); + d->group_id = make_kgid(user_ns, uv); if (!gid_valid(d->group_id)) return 0; d->group_id_present = 1; @@ -573,8 +574,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root) struct super_block *sb = root->d_sb; struct fuse_conn *fc = get_fuse_conn_super(sb); - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id)); - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id)); + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id)); + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id)); if (fc->default_permissions) seq_puts(m, ",default_permissions"); if (fc->allow_other) @@ -605,7 +606,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq) fpq->connected = 1; } -void fuse_conn_init(struct fuse_conn *fc) +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns) { memset(fc, 0, sizeof(*fc)); spin_lock_init(&fc->lock); @@ -629,6 +630,7 @@ void fuse_conn_init(struct fuse_conn *fc) fc->attr_version = 1; get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); fc->pid_ns = get_pid_ns(task_active_pid_ns(current)); + fc->user_ns = get_user_ns(user_ns); } EXPORT_SYMBOL_GPL(fuse_conn_init); @@ -638,6 +640,7 @@ void fuse_conn_put(struct fuse_conn *fc) if (fc->destroy_req) fuse_request_free(fc->destroy_req); put_pid_ns(fc->pid_ns); + put_user_ns(fc->user_ns); fc->release(fc); } } @@ -1068,7 +1071,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION); - if (!parse_fuse_opt(data, &d, is_bdev)) + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns)) goto err; if (is_bdev) { @@ -1093,8 +1096,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) if (!file) goto err; - if ((file->f_op != &fuse_dev_operations) || - (file->f_cred->user_ns != &init_user_ns)) + /* + * Require mount to happen from the same user namespace which + * opened /dev/fuse to prevent potential attacks. + */ + if (file->f_op != &fuse_dev_operations || + file->f_cred->user_ns != sb->s_user_ns) goto err_fput; fc = kmalloc(sizeof(*fc), GFP_KERNEL); @@ -1102,7 +1109,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent) if (!fc) goto err_fput; - fuse_conn_init(fc); + fuse_conn_init(fc, sb->s_user_ns); fc->release = fuse_free_conn; fud = fuse_dev_alloc(fc); -- 2.14.1 From ebiederm at xmission.com Mon Feb 26 23:53:02 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 17:53:02 -0600 Subject: [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant In-Reply-To: <87po4rz4ui.fsf_-_@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> Message-ID: <20180226235302.12708-7-ebiederm@xmission.com> From: Seth Forshee Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Also export current_in_userns() for use by fuse when built as a module. Cc: linux-fsdevel at vger.kernel.org Cc: linux-kernel at vger.kernel.org Cc: "Eric W. Biederman" Cc: Serge Hallyn Cc: Miklos Szeredi Acked-by: Miklos Szeredi Reviewed-by: Serge Hallyn Reviewed-by: "Eric W. Biederman" Signed-off-by: Seth Forshee Signed-off-by: Dongsu Park Signed-off-by: Eric W. Biederman --- fs/fuse/dir.c | 2 +- kernel/user_namespace.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 79cca1687457..0cbd1ff3dd48 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc) const struct cred *cred; if (fc->allow_other) - return 1; + return current_in_userns(fc->user_ns); cred = current_cred(); if (uid_eq(cred->euid, fc->user_id) && diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 246d4d4ce5c7..492c255e6c5a 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns) { return in_userns(target_ns, current_user_ns()); } +EXPORT_SYMBOL(current_in_userns); static inline struct user_namespace *to_user_ns(struct ns_common *ns) { -- 2.14.1 From sargun at sargun.me Tue Feb 27 00:01:13 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 16:01:13 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <20180226230418.46nczgkh5csakyu7@ast-mbp> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov wrote: > On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >> This patchset enables seccomp filters to be written in eBPF. Although, this >> patchset doesn't introduce much of the functionality enabled by eBPF, it lays >> the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE >> support in order to utilize eBPF seccomp filters, as eBPF filters cannot be >> retrieved via the ptrace GET_FILTER API. > > this was discussed multiple times in the past. > In eBPF land it's practically impossible to do checkpoint/restore > of the whole bpf program/map graph. > >> Any user can load a bpf seccomp filter program, and it can be pinned and >> reused without requiring access to the bpf syscalls. A user only requires >> the traditional permissions of either being cap_sys_admin, or have >> no_new_privs set in order to install their rule. >> >> The primary reason for not adding maps support in this patchset is >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >> If we have a map that the BPF program can read, it can potentially >> "change" privileges after running. It seems like doing writes only >> is safe, because it can be pure, and side effect free, and therefore >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >> to an agreement, this can be in a follow-up patchset. > > readonly maps already exist. See BPF_F_RDONLY. > Is that not enough? > With BPF_F_RDONLY, is there a mechanism to populate a prog_array, and then mark it rd_only? >> A benchmark of this patchset is as follows for a very standard eBPF filter: >> >> Given this test program: >> for (i = 10; i < 99999999; i++) syscall(__NR_getpid); >> >> If I implement an eBPF filter with PROG_ARRAYs with a program per syscall, >> and tail call, the numbers are such: >> ebpf JIT 12.3% slower than native >> ebpf no JIT 13.6% slower than native >> seccomp JIT 17.6% slower than native >> seccomp no JIT 37% slower than native > > the perf gains are misleading, since patches don't enable bpf_tail_call. > > The main statement I want to hear from seccomp maintainers before > proceeding any further on this that enabling eBPF in seccomp won't lead > to seccomp folks arguing against changes in bpf core (like verifier) > just because it's used by seccomp. > It must be spelled out in the commit log with explicit Ack. > From keescook at chromium.org Tue Feb 27 00:49:21 2018 From: keescook at chromium.org (Kees Cook) Date: Mon, 26 Feb 2018 16:49:21 -0800 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180214152958.cjgwh2k52zji2jxk@cisco> Message-ID: On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski wrote: > On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen wrote: >> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: >>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: >>> I wonder if this communication should be netlink, which gives a more >>> well-structured way to describe what's on the wire? The reason I ask >>> is because if we ever change the seccomp_data structure, we'll now >>> have two places where we need to deal with it (the first being within >>> the BPF itself). My initial idea was to prefix the communication with >>> a size field, then send the structure, and then I had nightmares, and >>> realized this was basically netlink reinvented. >> >> I suggested netlink in LA, and everyone (especially Andy) groaned very >> loudly :). I'm happy to switch it to netlink if you like, although i >> think memcpy() of structs should be safe here, since the return value >> from read or write can indicate the size of things. > > I could easily get on board with "netlink" (i.e. NLA) messages sent > over an fd. I will object strongly to the use of netlink *sockets*. Yeah, I was thinking NLA over the fd; not a netlink socket. >>> An ERRNO filter would block a USER_NOTIF because it's unconditional. >>> TRACE could be either, USER_NOTIF could be either. >>> >>> This means TRACE rules would be bumped by a USER_NOTIF... hmm. >> >> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all >> seemed more important than USER_NOTIF, but TRACE didn't. I don't have >> a strong opinion about what to do here, because users can adjust their >> filters accordingly. Let me know what you prefer. > > If we switched to eBPF functions, this whole issue goes away. Yeah, though we'd still need some kind of "wait for answer" eBPF function. It feels wrong to re-use maps for that... -Kees -- Kees Cook Pixel Security From tycho at tycho.ws Tue Feb 27 00:54:46 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Mon, 26 Feb 2018 17:54:46 -0700 Subject: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> References: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> Message-ID: <20180227005446.cmwsmh3fz4vhimmt@smitten> On Mon, Feb 26, 2018 at 07:27:05AM +0000, Sargun Dhillon wrote: > +config SECCOMP_FILTER_EXTENDED > + bool "Extended BPF seccomp filters" > + depends on SECCOMP_FILTER && BPF_SYSCALL > + depends on !CHECKPOINT_RESTORE Why not just give -EINVAL or something in case one of these is requested, instead of making them incompatible at compile time? Tycho From tycho at tycho.ws Tue Feb 27 01:01:53 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Mon, 26 Feb 2018 18:01:53 -0700 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: <20180227010153.leq7u45z57ypip2z@smitten> On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote: > On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov > wrote: > > On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: > >> This patchset enables seccomp filters to be written in eBPF. Although, this > >> [...] > > The main statement I want to hear from seccomp maintainers before > > proceeding any further on this that enabling eBPF in seccomp won't lead > > to seccomp folks arguing against changes in bpf core (like verifier) > > just because it's used by seccomp. > > It must be spelled out in the commit log with explicit Ack. > > The primary thing I'm concerned about with eBPF and seccomp is > side-effects from eBPF programs running at syscall time. This is an > extremely sensitive area, and I want to be sure there won't be > feature-creep here that leads to seccomp getting into a bad state. > > As long as seccomp can continue have its own verifier, I guess these patches should introduce some additional restrictions in kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF verifier allows. > I *think* this will be fine, though, again I remain concerned about > maps, etc. I'm still reviewing these patches and how they might > provide overlap with Tycho's needs too, etc. Yes, it's on my TODO list to take a look at how to do it as suggested by Alexi on top of this set before posting a v2. Haven't had time recently, though. Cheers, Tycho From torvalds at linux-foundation.org Tue Feb 27 01:13:59 2018 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Mon, 26 Feb 2018 17:13:59 -0800 Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE In-Reply-To: <20180226235302.12708-3-ebiederm@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> <20180226235302.12708-3-ebiederm@xmission.com> Message-ID: On Mon, Feb 26, 2018 at 3:52 PM, Eric W. Biederman wrote: > > Additionaly update the comment above the call to get_acl itself and > remove the wrong information that an implementation of get_acl can > prevent caching by calling forget_cached_acl. This part is just confusing. First off, that comment is correct: a filesystem _can_ prevent the returning of cached data by just calling forget_cached_acl(). Note that there are two different cases: saying that you _never_ want to cache things (ACL_DONT_CACHE) and saying that there _currently_ is no cached data (ACL_NOT_CACHED). forget_cached_acl() just removes the current cache. You're just replacing one case of "no cached" information with the other. Just explain the two cases, don't try to muddy the waters even more.. PLUS you are just confusing things entirely. That whole new comment of yours: + * ACL_DONT_CACHE is treated as another task updating the acl and + * remains set. is just garbage. The code is very clear - it will only replace a ACL_NOT_CACHED entry. The code is clear: if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED) /* fall through */ ; this is basically just an atomic "if *p == ACL_NOT_CACHED then replace it with 'sentinel'". Your comment does not add any clarity at all, and only confuses things. It has nothing to do with "treated as another task updating the acl". The fact is, ACL_DONT_CACHE is treated as if the cache is simply already filled - it's just filled with "no cache". So the only thing special is ACL_NOT_CACHED, which is the only thing we will try to _replace_. So NAK on this patch entirely. It's just adding confusion, not adding clarifications. Linus From ebiederm at xmission.com Tue Feb 27 02:53:02 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 20:53:02 -0600 Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE In-Reply-To: (Linus Torvalds's message of "Mon, 26 Feb 2018 17:13:59 -0800") References: <87po4rz4ui.fsf_-_@xmission.com> <20180226235302.12708-3-ebiederm@xmission.com> Message-ID: <87r2p7rvn5.fsf@xmission.com> So the purpose for having a patch in the first place is that 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer") which addded ACL_DONT_CACHED did not result in any comment updates to get_acl. Which mean that if you read the comments in get_acl() that you don't even think of ACL_DONT_CACHED. Which means that this comment: /* * If the ACL isn't being read yet, set our sentinel. Otherwise, the * current value of the ACL will not be ACL_NOT_CACHED and so our own * sentinel will not be set; another task will update the cache. We * could wait for that other task to complete its job, but it's easier * to just call ->get_acl to fetch the ACL ourself. (This is going to * be an unlikely race.) */ Which presumes the only reason the acl could be anything other ACL_NOT_CACHED is because get_acl() is already being called upon it in another task. I wanted something to mention ACL_DONT_CACHED so someone would at least think about that case if they ever step up to modify the code. The code is perfectly clear, the comment is not. That scares me. And I had to read the code about a dozen times before I realized the ACL_DONT_CACHED case even exists. Not useful when I am need to use that to preserve historical fuse semantics. So something is missing here even if my wording does not improve things. Then we get this comment: /* * Normally, the ACL returned by ->get_acl will be cached. * A filesystem can prevent that by calling * forget_cached_acl(inode, type) in ->get_acl. */ Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes") That comment is and always has been rubbish. I don't have a clue what it is trying to say but it is not something a person can use to write filesystem code with. Truths: - forget_cached_acl(inode, type) can be used to invalidate the acl cache. - Calling forget_cached_acl from within the filesystems ->get_acl method won't prevent a cached value from being returend because ->get_acl will be set. - Calling forget_cached_acl from within the filesystems ->get_acl method won't prevent a returned value from being cached because it the caching happens after ->get_acl returns. - Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent a value from ->get_acl from being cached. In summary I only care about two things. 1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking at the code, and people updating the code will have a hint that they need to consider that case. 2) That misleading completely bogus comment being removed/fixed. And yes I agree the code is clear. The comments are not. Does this look better as a comment updating patch? diff --git a/fs/posix_acl.c b/fs/posix_acl.c index 2fd0fde16fe1..5453094b8828 100644 --- a/fs/posix_acl.c +++ b/fs/posix_acl.c @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type) struct posix_acl **p; struct posix_acl *acl; + /* + * To avoid caching the result of ->get_acl + * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE; + */ + /* * The sentinel is used to detect when another operation like * set_cached_acl() or forget_cached_acl() races with get_acl(). @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type) /* fall through */ ; /* - * Normally, the ACL returned by ->get_acl will be cached. - * A filesystem can prevent that by calling - * forget_cached_acl(inode, type) in ->get_acl. + * The ACL returned by ->get_acl will be cached. * * If the filesystem doesn't have a get_acl() function at all, we'll * just create the negative cache entry. Eric From ebiederm at xmission.com Tue Feb 27 03:14:52 2018 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 26 Feb 2018 21:14:52 -0600 Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE In-Reply-To: <87r2p7rvn5.fsf@xmission.com> (Eric W. Biederman's message of "Mon, 26 Feb 2018 20:53:02 -0600") References: <87po4rz4ui.fsf_-_@xmission.com> <20180226235302.12708-3-ebiederm@xmission.com> <87r2p7rvn5.fsf@xmission.com> Message-ID: <87tvu3qg2b.fsf@xmission.com> ebiederm at xmission.com (Eric W. Biederman) writes: 2> So the purpose for having a patch in the first place is that > 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer") > which addded ACL_DONT_CACHED did not result in any comment updates > to get_acl. > > Which mean that if you read the comments in get_acl() that you > don't even think of ACL_DONT_CACHED. > > Which means that this comment: > /* > * If the ACL isn't being read yet, set our sentinel. Otherwise, the > * current value of the ACL will not be ACL_NOT_CACHED and so our own > * sentinel will not be set; another task will update the cache. We > * could wait for that other task to complete its job, but it's easier > * to just call ->get_acl to fetch the ACL ourself. (This is going to > * be an unlikely race.) > */ > > Which presumes the only reason the acl could be anything other > ACL_NOT_CACHED is because get_acl() is already being called upon it in > another task. > > I wanted something to mention ACL_DONT_CACHED so someone would at least > think about that case if they ever step up to modify the code. > > The code is perfectly clear, the comment is not. That scares me. > > And I had to read the code about a dozen times before I realized the > ACL_DONT_CACHED case even exists. Not useful when I am need to use > that to preserve historical fuse semantics. > > So something is missing here even if my wording does not improve things. > > > > Then we get this comment: > /* > * Normally, the ACL returned by ->get_acl will be cached. > * A filesystem can prevent that by calling > * forget_cached_acl(inode, type) in ->get_acl. > */ > > Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes") > That comment is and always has been rubbish. > > I don't have a clue what it is trying to say but it is not something > a person can use to write filesystem code with. > > > Truths: > - forget_cached_acl(inode, type) can be used to invalidate the acl > cache. > > - Calling forget_cached_acl from within the filesystems ->get_acl > method won't prevent a cached value from being returend because > ->get_acl will be set. > > - Calling forget_cached_acl from within the filesystems ->get_acl > method won't prevent a returned value from being cached > because it the caching happens after ->get_acl returns. Sigh. Yes it will because we set the special sentinel value, and forget_cached_acl will replace the sentinel value with ACL_NOT_CACHED. It is a terribly brittle and racy thing to do, and it probably won't work to say cache this acl but not this one on a case by case bases in ->get_acl. As such I believe that usage of forget_cached_acl should be subsumed by using ACL_NOT_CACHED. If not we should really come up with a different helper function name to call from ->get_acl. Preferably one that does "cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races. > - Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent > a value from ->get_acl from being cached. > > > In summary I only care about two things. > 1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking > at the code, and people updating the code will have a hint that they > need to consider that case. > > 2) That misleading completely bogus comment being removed/fixed. > > > And yes I agree the code is clear. The comments are not. > > > Does this look better as a comment updating patch? > > diff --git a/fs/posix_acl.c b/fs/posix_acl.c > index 2fd0fde16fe1..5453094b8828 100644 > --- a/fs/posix_acl.c > +++ b/fs/posix_acl.c > @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type) > struct posix_acl **p; > struct posix_acl *acl; > > + /* > + * To avoid caching the result of ->get_acl > + * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE; > + */ > + > /* > * The sentinel is used to detect when another operation like > * set_cached_acl() or forget_cached_acl() races with get_acl(). > @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type) > /* fall through */ ; > > /* > - * Normally, the ACL returned by ->get_acl will be cached. > - * A filesystem can prevent that by calling > - * forget_cached_acl(inode, type) in ->get_acl. > + * The ACL returned by ->get_acl will be cached. > * > * If the filesystem doesn't have a get_acl() function at all, we'll > * just create the negative cache entry. > > Eric From luto at amacapital.net Tue Feb 27 03:27:51 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Mon, 26 Feb 2018 19:27:51 -0800 Subject: [RFC 1/3] seccomp: add a return code to trap to userspace In-Reply-To: References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> <20180214152958.cjgwh2k52zji2jxk@cisco> Message-ID: <6314EDD9-2B0F-454C-9B99-E57694DC7AE1@amacapital.net> > On Feb 26, 2018, at 4:49 PM, Kees Cook wrote: > >> On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski wrote: >>> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen wrote: >>>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: >>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: >>>> I wonder if this communication should be netlink, which gives a more >>>> well-structured way to describe what's on the wire? The reason I ask >>>> is because if we ever change the seccomp_data structure, we'll now >>>> have two places where we need to deal with it (the first being within >>>> the BPF itself). My initial idea was to prefix the communication with >>>> a size field, then send the structure, and then I had nightmares, and >>>> realized this was basically netlink reinvented. >>> >>> I suggested netlink in LA, and everyone (especially Andy) groaned very >>> loudly :). I'm happy to switch it to netlink if you like, although i >>> think memcpy() of structs should be safe here, since the return value >>> from read or write can indicate the size of things. >> >> I could easily get on board with "netlink" (i.e. NLA) messages sent >> over an fd. I will object strongly to the use of netlink *sockets*. > > Yeah, I was thinking NLA over the fd; not a netlink socket. > >>>> An ERRNO filter would block a USER_NOTIF because it's unconditional. >>>> TRACE could be either, USER_NOTIF could be either. >>>> >>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm. >>> >>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all >>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have >>> a strong opinion about what to do here, because users can adjust their >>> filters accordingly. Let me know what you prefer. >> >> If we switched to eBPF functions, this whole issue goes away. > > Yeah, though we'd still need some kind of "wait for answer" eBPF > function. It feels wrong to re-use maps for that... > BPF_CALL. Alexei, can we make it so that each bpf program type can easily limit which BPF_CALL helpers can be use and allow bpf program types to add their own helpers?c From torvalds at linux-foundation.org Tue Feb 27 03:36:48 2018 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Mon, 26 Feb 2018 19:36:48 -0800 Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE In-Reply-To: <87r2p7rvn5.fsf@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> <20180226235302.12708-3-ebiederm@xmission.com> <87r2p7rvn5.fsf@xmission.com> Message-ID: On Mon, Feb 26, 2018 at 6:53 PM, Eric W. Biederman wrote: > > So the purpose for having a patch in the first place is that > 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer") > which addded ACL_DONT_CACHED did not result in any comment updates > to get_acl. I'm not opposed to just updating the comments. I just think your updates were somewhat misleading. > Which mean that if you read the comments in get_acl() that you > don't even think of ACL_DONT_CACHED. Right. By all means add a comment about ACL_DONT_CACHE disabling the cache entirely. But don't _remove_ the other valid way to flush the cache, and don't make that comment above cmpxchg() be even more confusing than the code is. > Does this look better as a comment updating patch? > > diff --git a/fs/posix_acl.c b/fs/posix_acl.c > index 2fd0fde16fe1..5453094b8828 100644 > --- a/fs/posix_acl.c > +++ b/fs/posix_acl.c > @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type) > struct posix_acl **p; > struct posix_acl *acl; > > + /* > + * To avoid caching the result of ->get_acl > + * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE; > + */ > + > /* > * The sentinel is used to detect when another operation like > * set_cached_acl() or forget_cached_acl() races with get_acl(). > @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type) > /* fall through */ ; > > /* > - * Normally, the ACL returned by ->get_acl will be cached. > - * A filesystem can prevent that by calling > - * forget_cached_acl(inode, type) in ->get_acl. > + * The ACL returned by ->get_acl will be cached. Why do you hate forget_cached_acl()? It's perfectly valid too. Don't remove that comment. Maybe reword it to talk not about "preventing", but about "invalidating the cache". But the old comment that you remove isn't _wrong_, it's just that the "preventing" from returning the cached state with forget_cached_acl() is just a one-time thing. So forget_cached_acl() exists, and it works, and it does exactly what its name says. It is a perfectly valid way to prevent the current entry from being used in the future. See? I object to you removing that, and trying to make it be like ACL_DONT_CACHE is the *onyl* way to not cache something. Because honestly, that's what your comment updates do. They take the comments about _one_ case, and switch it over to be about the _othger_ case. But dammit, there are _two_ ways to not cache things. "Fixing" the comment to talk about one and removing the other isn't a fix. It's just a stupid change that now has the problem the other way around! So fix the comment to really just talk about both things. First: talk about how to avoid caching entirely (ACL_DONT_CACHE). Then, talk about how to invalidate the cache once it has been instantiated (forget_cached_acl()). Don't do this idiotic "remove the valid comment just because you happened to care about the _other_ case" Linus From torvalds at linux-foundation.org Tue Feb 27 03:41:21 2018 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Mon, 26 Feb 2018 19:41:21 -0800 Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE In-Reply-To: <87tvu3qg2b.fsf@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> <20180226235302.12708-3-ebiederm@xmission.com> <87r2p7rvn5.fsf@xmission.com> <87tvu3qg2b.fsf@xmission.com> Message-ID: On Mon, Feb 26, 2018 at 7:14 PM, Eric W. Biederman wrote: > > As such I believe that usage of forget_cached_acl should be subsumed by > using ACL_NOT_CACHED. If not we should really come up with a different > helper function name to call from ->get_acl. Preferably one that does > "cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races. You make your bias very clear, by simply trying to hide the other case. But for chrissake, that's not the state right now. That other case exists. You can't - and shouldn't - try to just hide it. Besides, that "forget_cached_acl()" approach actually has a valid use case. Maybe you _do_ want to cache ACL's, but with a timeout or revalidation. ACL_DONT_CACHE really is a big hammer that makes caching not work at all. It's not necessarily the right thing to do at all. Linus From sargun at sargun.me Tue Feb 27 03:46:19 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 19:46:19 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <20180227010153.leq7u45z57ypip2z@smitten> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> <20180227010153.leq7u45z57ypip2z@smitten> Message-ID: On Mon, Feb 26, 2018 at 5:01 PM, Tycho Andersen wrote: > On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote: >> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov >> wrote: >> > On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >> >> This patchset enables seccomp filters to be written in eBPF. Although, this >> >> [...] >> > The main statement I want to hear from seccomp maintainers before >> > proceeding any further on this that enabling eBPF in seccomp won't lead >> > to seccomp folks arguing against changes in bpf core (like verifier) >> > just because it's used by seccomp. >> > It must be spelled out in the commit log with explicit Ack. >> >> The primary thing I'm concerned about with eBPF and seccomp is >> side-effects from eBPF programs running at syscall time. This is an >> extremely sensitive area, and I want to be sure there won't be >> feature-creep here that leads to seccomp getting into a bad state. >> >> As long as seccomp can continue have its own verifier, > > I guess these patches should introduce some additional restrictions in > kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF > verifier allows. > Like what? The helpers allowed are listed in seccomp.c. You have the same restrictions as the traditional eBPF verifier (no unsafe memory access, jumps backwards, etc..). I'm not sure which built-in eBPF functionality presents risk. >> I *think* this will be fine, though, again I remain concerned about >> maps, etc. I'm still reviewing these patches and how they might >> provide overlap with Tycho's needs too, etc. > > Yes, it's on my TODO list to take a look at how to do it as suggested > by Alexi on top of this set before posting a v2. Haven't had time > recently, though. > > Cheers, > > Tycho There's a lot of interest (in general) of having a mechanism to do notifications to userspace processes from eBPF for a variety of use cases. I think that this would be valuable for more than just seccomp, if it's implemented in a general purpose manner. From sargun at sargun.me Tue Feb 27 03:49:48 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 19:49:48 -0800 Subject: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <20180227005446.cmwsmh3fz4vhimmt@smitten> References: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> <20180227005446.cmwsmh3fz4vhimmt@smitten> Message-ID: On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen wrote: > On Mon, Feb 26, 2018 at 07:27:05AM +0000, Sargun Dhillon wrote: >> +config SECCOMP_FILTER_EXTENDED >> + bool "Extended BPF seccomp filters" >> + depends on SECCOMP_FILTER && BPF_SYSCALL >> + depends on !CHECKPOINT_RESTORE > > Why not just give -EINVAL or something in case one of these is > requested, instead of making them incompatible at compile time? > > Tycho There's already code to return -EMEDIUMTYPE if it's a non-classic, or non-saved filter. Under the normal case, with CHECKPOINT_RESTORE enabled, you should never be able to get that. I think it makes sense to preserve this behaviour. My rough plan is to introduce a mechanism to dump filters like you can cBPF filters. If you look at my v1, there was a patch that did this. Once this gets in, I can prepare that patch, and we can lift this restriction. From tycho at tycho.ws Tue Feb 27 03:57:46 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Mon, 26 Feb 2018 20:57:46 -0700 Subject: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: References: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> <20180227005446.cmwsmh3fz4vhimmt@smitten> Message-ID: <20180227035746.vh5mw7ijbyg3mbq3@cisco> On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote: > On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen wrote: > > On Mon, Feb 26, 2018 at 07:27:05AM +0000, Sargun Dhillon wrote: > >> +config SECCOMP_FILTER_EXTENDED > >> + bool "Extended BPF seccomp filters" > >> + depends on SECCOMP_FILTER && BPF_SYSCALL > >> + depends on !CHECKPOINT_RESTORE > > > > Why not just give -EINVAL or something in case one of these is > > requested, instead of making them incompatible at compile time? > > > > Tycho > There's already code to return -EMEDIUMTYPE if it's a non-classic, or > non-saved filter. Under the normal case, with CHECKPOINT_RESTORE > enabled, you should never be able to get that. I think it makes sense > to preserve this behaviour. Oh, right. So can't we just drop this, and the existing code will DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't supported, until they are? Tycho From tycho at tycho.ws Tue Feb 27 04:01:49 2018 From: tycho at tycho.ws (Tycho Andersen) Date: Mon, 26 Feb 2018 21:01:49 -0700 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> <20180227010153.leq7u45z57ypip2z@smitten> Message-ID: <20180227040149.3br32qpbrxt2pd5h@cisco> On Mon, Feb 26, 2018 at 07:46:19PM -0800, Sargun Dhillon wrote: > On Mon, Feb 26, 2018 at 5:01 PM, Tycho Andersen wrote: > > On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote: > >> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov > >> wrote: > >> > On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: > >> >> This patchset enables seccomp filters to be written in eBPF. Although, this > >> >> [...] > >> > The main statement I want to hear from seccomp maintainers before > >> > proceeding any further on this that enabling eBPF in seccomp won't lead > >> > to seccomp folks arguing against changes in bpf core (like verifier) > >> > just because it's used by seccomp. > >> > It must be spelled out in the commit log with explicit Ack. > >> > >> The primary thing I'm concerned about with eBPF and seccomp is > >> side-effects from eBPF programs running at syscall time. This is an > >> extremely sensitive area, and I want to be sure there won't be > >> feature-creep here that leads to seccomp getting into a bad state. > >> > >> As long as seccomp can continue have its own verifier, > > > > I guess these patches should introduce some additional restrictions in > > kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF > > verifier allows. > > > Like what? The helpers allowed are listed in seccomp.c. You have the > same restrictions as the traditional eBPF verifier (no unsafe memory > access, jumps backwards, etc..). I'm not sure which built-in eBPF > functionality presents risk. I think that's the $64,000 question that Kees is trying to answer r.e. maps, etc. There's also the possibility that eBPF grows something new that's unsafe for seccomp. Cheers, Tycho From sargun at sargun.me Tue Feb 27 04:08:16 2018 From: sargun at sargun.me (Sargun Dhillon) Date: Mon, 26 Feb 2018 20:08:16 -0800 Subject: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: <20180227035746.vh5mw7ijbyg3mbq3@cisco> References: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> <20180227005446.cmwsmh3fz4vhimmt@smitten> <20180227035746.vh5mw7ijbyg3mbq3@cisco> Message-ID: On Mon, Feb 26, 2018 at 7:57 PM, Tycho Andersen wrote: > On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote: >> On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen wrote: >> > On Mon, Feb 26, 2018 at 07:27:05AM +0000, Sargun Dhillon wrote: >> >> +config SECCOMP_FILTER_EXTENDED >> >> + bool "Extended BPF seccomp filters" >> >> + depends on SECCOMP_FILTER && BPF_SYSCALL >> >> + depends on !CHECKPOINT_RESTORE >> > >> > Why not just give -EINVAL or something in case one of these is >> > requested, instead of making them incompatible at compile time? >> > >> > Tycho >> There's already code to return -EMEDIUMTYPE if it's a non-classic, or >> non-saved filter. Under the normal case, with CHECKPOINT_RESTORE >> enabled, you should never be able to get that. I think it makes sense >> to preserve this behaviour. > > Oh, right. So can't we just drop this, and the existing code will > DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't > supported, until they are? > > Tycho My suggestion is we merge this as is, so we don't break checkpoint / restore, and I will try to get the filter dumping patching in the same development cycle as it comes at minimal risk. Otherwise, we risk introducing a feature which could break checkpoint/restore, even in unprivileged containers since anyone can load a BPF Seccomp filter. From luto at amacapital.net Tue Feb 27 04:19:49 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Tue, 27 Feb 2018 04:19:49 +0000 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: > On Feb 26, 2018, at 3:20 PM, Kees Cook wrote: > > On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov > wrote: >>> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >>> This patchset enables seccomp filters to be written in eBPF. Although, this >>> [...] >> The main statement I want to hear from seccomp maintainers before >> proceeding any further on this that enabling eBPF in seccomp won't lead >> to seccomp folks arguing against changes in bpf core (like verifier) >> just because it's used by seccomp. >> It must be spelled out in the commit log with explicit Ack. > > The primary thing I'm concerned about with eBPF and seccomp is > side-effects from eBPF programs running at syscall time. This is an > extremely sensitive area, and I want to be sure there won't be > feature-creep here that leads to seccomp getting into a bad state. > > As long as seccomp can continue have its own verifier, I *think* this > will be fine, though, again I remain concerned about maps, etc. I'm > still reviewing these patches and how they might provide overlap with > Tycho's needs too, etc. I'm not sure I see this as a huge problem. As far as I can see, there are three ways that a verifier change could be problematic: 1. Addition of a new type of map. But seccomp would just not allow new map types by default, right? 2. Addition of a new BPF_CALLable helper. Seccomp wants a way to whitelist BPF_CALL targets. That should be straightforward. 3. Straight-up bugs. Those are exactly as problematic as verifier bugs in any other unprivileged eBPF program type, right? I don't see why seccomp is special here. From keescook at chromium.org Tue Feb 27 04:31:12 2018 From: keescook at chromium.org (Kees Cook) Date: Mon, 26 Feb 2018 20:31:12 -0800 Subject: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities In-Reply-To: References: <20180226072702.GA27057@ircssh-2.c.rugged-nimbus-611.internal> <20180227005446.cmwsmh3fz4vhimmt@smitten> <20180227035746.vh5mw7ijbyg3mbq3@cisco> Message-ID: On Mon, Feb 26, 2018 at 8:08 PM, Sargun Dhillon wrote: > On Mon, Feb 26, 2018 at 7:57 PM, Tycho Andersen wrote: >> On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote: >>> On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen wrote: >>> > On Mon, Feb 26, 2018 at 07:27:05AM +0000, Sargun Dhillon wrote: >>> >> +config SECCOMP_FILTER_EXTENDED >>> >> + bool "Extended BPF seccomp filters" >>> >> + depends on SECCOMP_FILTER && BPF_SYSCALL >>> >> + depends on !CHECKPOINT_RESTORE >>> > >>> > Why not just give -EINVAL or something in case one of these is >>> > requested, instead of making them incompatible at compile time? >>> > >>> > Tycho >>> There's already code to return -EMEDIUMTYPE if it's a non-classic, or >>> non-saved filter. Under the normal case, with CHECKPOINT_RESTORE >>> enabled, you should never be able to get that. I think it makes sense >>> to preserve this behaviour. >> >> Oh, right. So can't we just drop this, and the existing code will >> DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't >> supported, until they are? >> >> Tycho > My suggestion is we merge this as is, so we don't break checkpoint / > restore, and I will try to get the filter dumping patching in the same > development cycle as it comes at minimal risk. Otherwise, we risk > introducing a feature which could break checkpoint/restore, even in > unprivileged containers since anyone can load a BPF Seccomp filter. There is no rush to merge such a drastic expansion of the seccomp attack surface. :) For me, the driving feature is if we can get Tycho's notifier implemented in eBPF. The speed improvements, as far as I'm concerned, aren't sufficient to add eBPF to seccomp. They are certainly a nice benefit, but seccomp must be very conservative about adding attack surface. -Kees -- Kees Cook Pixel Security From keescook at chromium.org Tue Feb 27 04:38:18 2018 From: keescook at chromium.org (Kees Cook) Date: Mon, 26 Feb 2018 20:38:18 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski wrote: >> On Feb 26, 2018, at 3:20 PM, Kees Cook wrote: >> >> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov >> wrote: >>>> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >>>> This patchset enables seccomp filters to be written in eBPF. Although, this >>>> [...] >>> The main statement I want to hear from seccomp maintainers before >>> proceeding any further on this that enabling eBPF in seccomp won't lead >>> to seccomp folks arguing against changes in bpf core (like verifier) >>> just because it's used by seccomp. >>> It must be spelled out in the commit log with explicit Ack. >> >> The primary thing I'm concerned about with eBPF and seccomp is >> side-effects from eBPF programs running at syscall time. This is an >> extremely sensitive area, and I want to be sure there won't be >> feature-creep here that leads to seccomp getting into a bad state. >> >> As long as seccomp can continue have its own verifier, I *think* this >> will be fine, though, again I remain concerned about maps, etc. I'm >> still reviewing these patches and how they might provide overlap with >> Tycho's needs too, etc. > > I'm not sure I see this as a huge problem. As far as I can see, there > are three ways that a verifier change could be problematic: > > 1. Addition of a new type of map. But seccomp would just not allow > new map types by default, right? > > 2. Addition of a new BPF_CALLable helper. Seccomp wants a way to > whitelist BPF_CALL targets. That should be straightforward. Yup, agreed on 1 and 2. > 3. Straight-up bugs. Those are exactly as problematic as verifier > bugs in any other unprivileged eBPF program type, right? I don't see > why seccomp is special here. My concern is more about unintended design mistakes or other feature creep with side-effects, especially when it comes to privileges and synchronization. Getting no-new-privs done correctly, for example, took some careful thought and discussion, and I'm shy from how painful TSYNC was on the process locking side, and eBPF has had some rather ugly flaws in the past (and recently: it was nice to be able to say for Spectre that seccomp filters couldn't be constructed to make attacks but eBPF could). Adding the complexity needs to be worth the gain. I'm on board for doing it, I just want to be careful. :) -Kees -- Kees Cook Pixel Security From luto at amacapital.net Tue Feb 27 04:54:39 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Mon, 26 Feb 2018 20:54:39 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: > On Feb 26, 2018, at 8:38 PM, Kees Cook wrote: > > On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski wrote: >>> On Feb 26, 2018, at 3:20 PM, Kees Cook wrote: >>> >>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov >>> wrote: >>>>> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >>>>> This patchset enables seccomp filters to be written in eBPF. Although, this >>>>> [...] >>>> The main statement I want to hear from seccomp maintainers before >>>> proceeding any further on this that enabling eBPF in seccomp won't lead >>>> to seccomp folks arguing against changes in bpf core (like verifier) >>>> just because it's used by seccomp. >>>> It must be spelled out in the commit log with explicit Ack. >>> >>> The primary thing I'm concerned about with eBPF and seccomp is >>> side-effects from eBPF programs running at syscall time. This is an >>> extremely sensitive area, and I want to be sure there won't be >>> feature-creep here that leads to seccomp getting into a bad state. >>> >>> As long as seccomp can continue have its own verifier, I *think* this >>> will be fine, though, again I remain concerned about maps, etc. I'm >>> still reviewing these patches and how they might provide overlap with >>> Tycho's needs too, etc. >> >> I'm not sure I see this as a huge problem. As far as I can see, there >> are three ways that a verifier change could be problematic: >> >> 1. Addition of a new type of map. But seccomp would just not allow >> new map types by default, right? >> >> 2. Addition of a new BPF_CALLable helper. Seccomp wants a way to >> whitelist BPF_CALL targets. That should be straightforward. > > Yup, agreed on 1 and 2. > >> 3. Straight-up bugs. Those are exactly as problematic as verifier >> bugs in any other unprivileged eBPF program type, right? I don't see >> why seccomp is special here. > > My concern is more about unintended design mistakes or other feature > creep with side-effects, especially when it comes to privileges and > synchronization. Getting no-new-privs done correctly, for example, > took some careful thought and discussion, and I'm shy from how painful > TSYNC was on the process locking side, and eBPF has had some rather > ugly flaws in the past (and recently: it was nice to be able to say > for Spectre that seccomp filters couldn't be constructed to make > attacks but eBPF could). Adding the complexity needs to be worth the > gain. I'm on board for doing it, I just want to be careful. :) > I agree. I think that, if we do this right, we get a clean version of Tycho's notifiers. We can also very easily build on that to send a non-blocking message to the notifier fd, which gets us a version of seccomp logging that works for things like Chromium and even strace. I think this is worth it. I also think this sort of argument is why Micka?l's privileged-first Landlock is the wrong approach. By getting the unprivileged parts right from day one, we can carefully extend the mechanism and keep it usable by unprivileged apps. But, if we'd started as root-only, fixing up everything needed to make it safe for unprivileged users after the fact would have been quite messy. And the considerations for making eBPF safe for use by unprivileged tasks to filter their descendents are more or less the same for seccomp and Landlock. Can we please arrange things so we solve this problem only once? From mszeredi at redhat.com Tue Feb 27 09:00:25 2018 From: mszeredi at redhat.com (Miklos Szeredi) Date: Tue, 27 Feb 2018 10:00:25 +0100 Subject: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic. In-Reply-To: <20180226235302.12708-5-ebiederm@xmission.com> References: <87po4rz4ui.fsf_-_@xmission.com> <20180226235302.12708-5-ebiederm@xmission.com> Message-ID: On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman wrote: > Rename the fuse connection flag posix_acl to cached_posix_acl as that > is what it actually means. That fuse will cache and operate on the > cached value of the posix acl. > > When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode > so that get_acl and friends won't cache the acl values even if they > are called. > > Replace forget_all_cached_acls with fuse_forget_cached_acls. This > wrapper only takes effect when cached_posix_acl is true to prevent > losing the nocache or noxattr status in when posix acls are not > cached. Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE? I think it makes sense to generally not clear ACL_DONT_CACHE, since it's not an actual acl value that needs forgetting. Thanks, Miklos From daniel at iogearbox.net Tue Feb 27 09:28:30 2018 From: daniel at iogearbox.net (Daniel Borkmann) Date: Tue, 27 Feb 2018 10:28:30 +0100 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On 02/27/2018 01:01 AM, Sargun Dhillon wrote: > On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov > wrote: >> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >>> This patchset enables seccomp filters to be written in eBPF. Although, this >>> patchset doesn't introduce much of the functionality enabled by eBPF, it lays >>> the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE >>> support in order to utilize eBPF seccomp filters, as eBPF filters cannot be >>> retrieved via the ptrace GET_FILTER API. >> >> this was discussed multiple times in the past. >> In eBPF land it's practically impossible to do checkpoint/restore >> of the whole bpf program/map graph. >> >>> Any user can load a bpf seccomp filter program, and it can be pinned and >>> reused without requiring access to the bpf syscalls. A user only requires >>> the traditional permissions of either being cap_sys_admin, or have >>> no_new_privs set in order to install their rule. >>> >>> The primary reason for not adding maps support in this patchset is >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >>> If we have a map that the BPF program can read, it can potentially >>> "change" privileges after running. It seems like doing writes only >>> is safe, because it can be pure, and side effect free, and therefore >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >>> to an agreement, this can be in a follow-up patchset. >> >> readonly maps already exist. See BPF_F_RDONLY. >> Is that not enough? >> > With BPF_F_RDONLY, is there a mechanism to populate a prog_array, and > then mark it rd_only? This would still need to be extended for this purpose. Right now this is either set on map creation (e.g. such that only prog itself can update the entries) or obj_get. So you'd need a mechanism that sets flags into rdonly mode where once set it cannot be undone anymore for the remaining lifetime of the map. >>> A benchmark of this patchset is as follows for a very standard eBPF filter: >>> >>> Given this test program: >>> for (i = 10; i < 99999999; i++) syscall(__NR_getpid); >>> >>> If I implement an eBPF filter with PROG_ARRAYs with a program per syscall, >>> and tail call, the numbers are such: >>> ebpf JIT 12.3% slower than native >>> ebpf no JIT 13.6% slower than native >>> seccomp JIT 17.6% slower than native >>> seccomp no JIT 37% slower than native >> >> the perf gains are misleading, since patches don't enable bpf_tail_call. >> >> The main statement I want to hear from seccomp maintainers before >> proceeding any further on this that enabling eBPF in seccomp won't lead >> to seccomp folks arguing against changes in bpf core (like verifier) >> just because it's used by seccomp. >> It must be spelled out in the commit log with explicit Ack. Fully agree. From chris.hyser at oracle.com Tue Feb 27 14:53:43 2018 From: chris.hyser at oracle.com (chris hyser) Date: Tue, 27 Feb 2018 09:53:43 -0500 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On 02/26/2018 11:38 PM, Kees Cook wrote: > On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski wrote: >> 3. Straight-up bugs. Those are exactly as problematic as verifier >> bugs in any other unprivileged eBPF program type, right? I don't see >> why seccomp is special here. > > My concern is more about unintended design mistakes or other feature > creep with side-effects, especially when it comes to privileges and > synchronization. Getting no-new-privs done correctly, for example, > took some careful thought and discussion, and I'm shy from how painful > TSYNC was on the process locking side, and eBPF has had some rather > ugly flaws in the past (and recently: it was nice to be able to say > for Spectre that seccomp filters couldn't be constructed to make > attacks but eBPF could). Adding the complexity needs to be worth the > gain. I'm on board for doing it, I just want to be careful. :) Another option might be to remove c/eBPF from the equation all together. c/eBPF allows flexibility and that almost always comes at the cost of additional security risk. Seccomp is for enhanced security yes? How about a new seccomp mode that passes in something like a bit vector or hashmap for "simple" white/black list checks validated by kernel code, versus user provided interpreted code? Of course this removes a fair number of things you can currently do or would be able to do with eBPF. Of course, restated from a security point of view, this removes a fair number of things an _attacker_ can do. Presumably the performance improvement would also be significant. Is this an idea worth prototyping? -chrish From keescook at chromium.org Tue Feb 27 16:00:06 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 27 Feb 2018 08:00:06 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On Tue, Feb 27, 2018 at 6:53 AM, chris hyser wrote: > On 02/26/2018 11:38 PM, Kees Cook wrote: >> >> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >> wrote: >>> >>> 3. Straight-up bugs. Those are exactly as problematic as verifier >>> bugs in any other unprivileged eBPF program type, right? I don't see >>> why seccomp is special here. >> >> >> My concern is more about unintended design mistakes or other feature >> creep with side-effects, especially when it comes to privileges and >> synchronization. Getting no-new-privs done correctly, for example, >> took some careful thought and discussion, and I'm shy from how painful >> TSYNC was on the process locking side, and eBPF has had some rather >> ugly flaws in the past (and recently: it was nice to be able to say >> for Spectre that seccomp filters couldn't be constructed to make >> attacks but eBPF could). Adding the complexity needs to be worth the >> gain. I'm on board for doing it, I just want to be careful. :) > > > > Another option might be to remove c/eBPF from the equation all together. > c/eBPF allows flexibility and that almost always comes at the cost of > additional security risk. Seccomp is for enhanced security yes? How about a > new seccomp mode that passes in something like a bit vector or hashmap for > "simple" white/black list checks validated by kernel code, versus user > provided interpreted code? Of course this removes a fair number of things > you can currently do or would be able to do with eBPF. Of course, restated > from a security point of view, this removes a fair number of things an > _attacker_ can do. Presumably the performance improvement would also be > significant. > > Is this an idea worth prototyping? That was the original prototype for seccomp-filter. :) The discussion around that from years ago basically boiled down to it being inflexible. Given all the things people want to do at syscall time, that continues to be true. So true, in fact, that here we are now, trying to move to eBPF from cBPF. ;) -Kees -- Kees Cook Pixel Security From chris.hyser at oracle.com Tue Feb 27 16:59:48 2018 From: chris.hyser at oracle.com (chris hyser) Date: Tue, 27 Feb 2018 11:59:48 -0500 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On 02/27/2018 11:00 AM, Kees Cook wrote: > On Tue, Feb 27, 2018 at 6:53 AM, chris hyser wrote: >> On 02/26/2018 11:38 PM, Kees Cook wrote: >>> >>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >>> wrote: >>>> >>>> 3. Straight-up bugs. Those are exactly as problematic as verifier >>>> bugs in any other unprivileged eBPF program type, right? I don't see >>>> why seccomp is special here. >>> >>> >>> My concern is more about unintended design mistakes or other feature >>> creep with side-effects, especially when it comes to privileges and >>> synchronization. Getting no-new-privs done correctly, for example, >>> took some careful thought and discussion, and I'm shy from how painful >>> TSYNC was on the process locking side, and eBPF has had some rather >>> ugly flaws in the past (and recently: it was nice to be able to say >>> for Spectre that seccomp filters couldn't be constructed to make >>> attacks but eBPF could). Adding the complexity needs to be worth the >>> gain. I'm on board for doing it, I just want to be careful. :) >> >> >> >> Another option might be to remove c/eBPF from the equation all together. >> c/eBPF allows flexibility and that almost always comes at the cost of >> additional security risk. Seccomp is for enhanced security yes? How about a >> new seccomp mode that passes in something like a bit vector or hashmap for >> "simple" white/black list checks validated by kernel code, versus user >> provided interpreted code? Of course this removes a fair number of things >> you can currently do or would be able to do with eBPF. Of course, restated >> from a security point of view, this removes a fair number of things an >> _attacker_ can do. Presumably the performance improvement would also be >> significant. >> >> Is this an idea worth prototyping? > > That was the original prototype for seccomp-filter. :) The discussion > around that from years ago basically boiled down to it being > inflexible. Given all the things people want to do at syscall time, > that continues to be true. So true, in fact, that here we are now, > trying to move to eBPF from cBPF. ;) I will try to find that discussion. As someone pointed out here though, eBPF is being used by more and more people in areas where security is not the primary concern. Differing objectives will make this a long term continuing issue. We ourselves were looking at eBPF simply as a means to use a hashmap for a white/blacklist, i.e. performance not flexibility. -chrish From keescook at chromium.org Tue Feb 27 19:19:12 2018 From: keescook at chromium.org (Kees Cook) Date: Tue, 27 Feb 2018 11:19:12 -0800 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On Tue, Feb 27, 2018 at 8:59 AM, chris hyser wrote: > On 02/27/2018 11:00 AM, Kees Cook wrote: >> >> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser >> wrote: >>> >>> On 02/26/2018 11:38 PM, Kees Cook wrote: >>>> >>>> >>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >>>> wrote: >>>>> >>>>> >>>>> 3. Straight-up bugs. Those are exactly as problematic as verifier >>>>> bugs in any other unprivileged eBPF program type, right? I don't see >>>>> why seccomp is special here. >>>> >>>> >>>> >>>> My concern is more about unintended design mistakes or other feature >>>> creep with side-effects, especially when it comes to privileges and >>>> synchronization. Getting no-new-privs done correctly, for example, >>>> took some careful thought and discussion, and I'm shy from how painful >>>> TSYNC was on the process locking side, and eBPF has had some rather >>>> ugly flaws in the past (and recently: it was nice to be able to say >>>> for Spectre that seccomp filters couldn't be constructed to make >>>> attacks but eBPF could). Adding the complexity needs to be worth the >>>> gain. I'm on board for doing it, I just want to be careful. :) >>> >>> >>> >>> >>> Another option might be to remove c/eBPF from the equation all together. >>> c/eBPF allows flexibility and that almost always comes at the cost of >>> additional security risk. Seccomp is for enhanced security yes? How about >>> a >>> new seccomp mode that passes in something like a bit vector or hashmap >>> for >>> "simple" white/black list checks validated by kernel code, versus user >>> provided interpreted code? Of course this removes a fair number of things >>> you can currently do or would be able to do with eBPF. Of course, >>> restated >>> from a security point of view, this removes a fair number of things an >>> _attacker_ can do. Presumably the performance improvement would also be >>> significant. >>> >>> Is this an idea worth prototyping? >> >> >> That was the original prototype for seccomp-filter. :) The discussion >> around that from years ago basically boiled down to it being >> inflexible. Given all the things people want to do at syscall time, >> that continues to be true. So true, in fact, that here we are now, >> trying to move to eBPF from cBPF. ;) > > > I will try to find that discussion. As someone pointed out here though, eBPF A good starting point might be this: https://lwn.net/Articles/441232/ -Kees -- Kees Cook Pixel Security From chris.hyser at oracle.com Tue Feb 27 21:22:45 2018 From: chris.hyser at oracle.com (chris hyser) Date: Tue, 27 Feb 2018 16:22:45 -0500 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On 02/27/2018 02:19 PM, Kees Cook wrote: > On Tue, Feb 27, 2018 at 8:59 AM, chris hyser wrote: >> I will try to find that discussion. As someone pointed out here though, eBPF > > A good starting point might be this: > https://lwn.net/Articles/441232/ Thanks. A fair amount of reading referenced there :-). In particular I'll be curious to find out what happened to this idea: "Essentially, that would make for three choices for each system call: enabled, disabled, or filtered." Something like that might address some of the security concerns in that a simple go/no go on syscall number need not incur the performance hit nor increased attack surface of running c/eBPF code, but it is there for argument checking, etc if you need it. Basically instead of the kernel making the flexibility/performance/security trade-off in advance, you leave it to user code/policy. Anyway, lest it is not clear :-), I think your instincts on security and eBPF are dead on. At the same time it is powerful and useful. So, how to make it optional? -chrish From daniel at iogearbox.net Tue Feb 27 21:58:11 2018 From: daniel at iogearbox.net (Daniel Borkmann) Date: Tue, 27 Feb 2018 22:58:11 +0100 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: On 02/27/2018 05:59 PM, chris hyser wrote: > On 02/27/2018 11:00 AM, Kees Cook wrote: >> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser wrote: >>> On 02/26/2018 11:38 PM, Kees Cook wrote: >>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >>>> wrote: >>>>> >>>>> 3. Straight-up bugs.? Those are exactly as problematic as verifier >>>>> bugs in any other unprivileged eBPF program type, right?? I don't see >>>>> why seccomp is special here. >>>> >>>> My concern is more about unintended design mistakes or other feature >>>> creep with side-effects, especially when it comes to privileges and >>>> synchronization. Getting no-new-privs done correctly, for example, >>>> took some careful thought and discussion, and I'm shy from how painful >>>> TSYNC was on the process locking side, and eBPF has had some rather >>>> ugly flaws in the past (and recently: it was nice to be able to say >>>> for Spectre that seccomp filters couldn't be constructed to make >>>> attacks but eBPF could). Adding the complexity needs to be worth the Well, not really. One part of all the Spectre mitigations that went upstream from BPF side was to have an option to remove interpreter entirely and that also relates to seccomp eventually. But other than that an attacker might potentially find as well useful gadgets inside seccomp or any other code that is inside the kernel, so it's not a strict necessity either. >>>> gain. I'm on board for doing it, I just want to be careful. :) >>> >>> Another option might be to remove c/eBPF from the equation all together. >>> c/eBPF allows flexibility and that almost always comes at the cost of >>> additional security risk. Seccomp is for enhanced security yes? How about a >>> new seccomp mode that passes in something like a bit vector or hashmap for >>> "simple" white/black list checks validated by kernel code, versus user >>> provided interpreted code? Of course this removes a fair number of things >>> you can currently do or would be able to do with eBPF. Of course, restated >>> from a security point of view, this removes a fair number of things an >>> _attacker_ can do. Presumably the performance improvement would also be >>> significant. Good luck with not breaking existing applications relying on seccomp out there. >>> Is this an idea worth prototyping? >> >> That was the original prototype for seccomp-filter. :) The discussion >> around that from years ago basically boiled down to it being >> inflexible. Given all the things people want to do at syscall time, >> that continues to be true. So true, in fact, that here we are now, >> trying to move to eBPF from cBPF. ;) Right, agree. cBPF is also pretty much frozen these days and aside from that, seccomp/BPF also just uses a proper subset of it. I wouldn't mind doing something similar for eBPF side as long as this is reasonably maintainable and not making BPF core more complex, but most of it can already be set in the verifier anyway based on prog type. Note, that performance of seccomp/BPF is definitely a demand as well which is why people still extend the old remaining cBPF JITs today such that it can be JITed also from there. > I will try to find that discussion. As someone pointed out here though, eBPF is being used by more and more people in areas where security is not the primary concern. Differing objectives will make this a long term continuing issue. We ourselves were looking at eBPF simply as a means to use a hashmap for a white/blacklist, i.e. performance not flexibility. Not really, security of verifier and BPF infra in general is on the top of the list, it's fundamental to the underlying concept and just because it is heavily used also in tracing and networking, it only shows that the concept is highly flexible that it can be applied in multiple areas. From chris.hyser at oracle.com Tue Feb 27 22:20:14 2018 From: chris.hyser at oracle.com (chris hyser) Date: Tue, 27 Feb 2018 17:20:14 -0500 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: <7fc0fab8-c1bc-bc76-a892-b3faab7d16ad@oracle.com> On 02/27/2018 04:58 PM, Daniel Borkmann wrote: > On 02/27/2018 05:59 PM, chris hyser wrote: >> On 02/27/2018 11:00 AM, Kees Cook wrote: >>> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser wrote: >>>> On 02/26/2018 11:38 PM, Kees Cook wrote: >>>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >>>>> wrote: >>>>>> >>>>>> 3. Straight-up bugs.? Those are exactly as problematic as verifier >>>>>> bugs in any other unprivileged eBPF program type, right?? I don't see >>>>>> why seccomp is special here. >>>>> >>>>> My concern is more about unintended design mistakes or other feature >>>>> creep with side-effects, especially when it comes to privileges and >>>>> synchronization. Getting no-new-privs done correctly, for example, >>>>> took some careful thought and discussion, and I'm shy from how painful >>>>> TSYNC was on the process locking side, and eBPF has had some rather >>>>> ugly flaws in the past (and recently: it was nice to be able to say >>>>> for Spectre that seccomp filters couldn't be constructed to make >>>>> attacks but eBPF could). Adding the complexity needs to be worth the > > Well, not really. One part of all the Spectre mitigations that went upstream > from BPF side was to have an option to remove interpreter entirely and that > also relates to seccomp eventually. But other than that an attacker might > potentially find as well useful gadgets inside seccomp or any other code > that is inside the kernel, so it's not a strict necessity either. > >>>>> gain. I'm on board for doing it, I just want to be careful. :) >>>> >>>> Another option might be to remove c/eBPF from the equation all together. >>>> c/eBPF allows flexibility and that almost always comes at the cost of >>>> additional security risk. Seccomp is for enhanced security yes? How about a >>>> new seccomp mode that passes in something like a bit vector or hashmap for >>>> "simple" white/black list checks validated by kernel code, versus user >>>> provided interpreted code? Of course this removes a fair number of things >>>> you can currently do or would be able to do with eBPF. Of course, restated >>>> from a security point of view, this removes a fair number of things an >>>> _attacker_ can do. Presumably the performance improvement would also be >>>> significant. > > Good luck with not breaking existing applications relying on seccomp out > there. This wasn't in the context of an implementation proposal, but the assumption would be to add this in addition to the old way. Now, does that make sense to do? That is the discussion. > >>>> Is this an idea worth prototyping? >>> >>> That was the original prototype for seccomp-filter. :) The discussion >>> around that from years ago basically boiled down to it being >>> inflexible. Given all the things people want to do at syscall time, >>> that continues to be true. So true, in fact, that here we are now, >>> trying to move to eBPF from cBPF. ;) > > Right, agree. cBPF is also pretty much frozen these days and aside from > that, seccomp/BPF also just uses a proper subset of it. I wouldn't mind > doing something similar for eBPF side as long as this is reasonably > maintainable and not making BPF core more complex, but most of it can > already be set in the verifier anyway based on prog type. Note, that > performance of seccomp/BPF is definitely a demand as well which is why > people still extend the old remaining cBPF JITs today such that it can > be JITed also from there. > >> I will try to find that discussion. As someone pointed out here though, eBPF is being used by more and more people in areas where security is not the primary concern. Differing objectives will make this a long term continuing issue. We ourselves were looking at eBPF simply as a means to use a hashmap for a white/blacklist, i.e. performance not flexibility. > > Not really, security of verifier and BPF infra in general is on the top > of the list, it's fundamental to the underlying concept and just because > it is heavily used also in tracing and networking, it only shows that the > concept is highly flexible that it can be applied in multiple areas. Ok. Let me look into this a bit because this is the heart of the matter. -chrish From mic at digikod.net Tue Feb 27 23:10:12 2018 From: mic at digikod.net (=?UTF-8?Q?Micka=c3=abl_Sala=c3=bcn?=) Date: Wed, 28 Feb 2018 00:10:12 +0100 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> Message-ID: <5323e010-09df-26d9-15f5-c723faa13224@digikod.net> On 27/02/2018 05:54, Andy Lutomirski wrote: > > >> On Feb 26, 2018, at 8:38 PM, Kees Cook wrote: >> >> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski wrote: >>>> On Feb 26, 2018, at 3:20 PM, Kees Cook wrote: >>>> >>>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov >>>> wrote: >>>>>> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >>>>>> This patchset enables seccomp filters to be written in eBPF. Although, this >>>>>> [...] >>>>> The main statement I want to hear from seccomp maintainers before >>>>> proceeding any further on this that enabling eBPF in seccomp won't lead >>>>> to seccomp folks arguing against changes in bpf core (like verifier) >>>>> just because it's used by seccomp. >>>>> It must be spelled out in the commit log with explicit Ack. >>>> >>>> The primary thing I'm concerned about with eBPF and seccomp is >>>> side-effects from eBPF programs running at syscall time. This is an >>>> extremely sensitive area, and I want to be sure there won't be >>>> feature-creep here that leads to seccomp getting into a bad state. >>>> >>>> As long as seccomp can continue have its own verifier, I *think* this >>>> will be fine, though, again I remain concerned about maps, etc. I'm >>>> still reviewing these patches and how they might provide overlap with >>>> Tycho's needs too, etc. >>> >>> I'm not sure I see this as a huge problem. As far as I can see, there >>> are three ways that a verifier change could be problematic: >>> >>> 1. Addition of a new type of map. But seccomp would just not allow >>> new map types by default, right? >>> >>> 2. Addition of a new BPF_CALLable helper. Seccomp wants a way to >>> whitelist BPF_CALL targets. That should be straightforward. >> >> Yup, agreed on 1 and 2. >> >>> 3. Straight-up bugs. Those are exactly as problematic as verifier >>> bugs in any other unprivileged eBPF program type, right? I don't see >>> why seccomp is special here. >> >> My concern is more about unintended design mistakes or other feature >> creep with side-effects, especially when it comes to privileges and >> synchronization. Getting no-new-privs done correctly, for example, >> took some careful thought and discussion, and I'm shy from how painful >> TSYNC was on the process locking side, and eBPF has had some rather >> ugly flaws in the past (and recently: it was nice to be able to say >> for Spectre that seccomp filters couldn't be constructed to make >> attacks but eBPF could). Adding the complexity needs to be worth the >> gain. I'm on board for doing it, I just want to be careful. :) >> > > I agree. I think that, if we do this right, we get a clean version of Tycho's notifiers. We can also very easily build on that to send a non-blocking message to the notifier fd, which gets us a version of seccomp logging that works for things like Chromium and even strace. I think this is worth it. > > I also think this sort of argument is why Micka?l's privileged-first Landlock is the wrong approach. By getting the unprivileged parts right from day one, we can carefully extend the mechanism and keep it usable by unprivileged apps. But, if we'd started as root-only, fixing up everything needed to make it safe for unprivileged users after the fact would have been quite messy. We agreed (including Kees and you, at the Santa Fe LPC) to limit the use of Landlock to CAP_SYS_ADMIN at first. It is an artificial limitation that can be re-enabled by removing three explicit checks/lines. Landlock was designed for unprivileged use from day one and it is still the goal. > > And the considerations for making eBPF safe for use by unprivileged tasks to filter their descendents are more or less the same for seccomp and Landlock. Can we please arrange things so we solve this problem only once? > Landlock is definitely focused on eBPF. It should not be hard to add a new Landlock program type to mimic the seccomp filter checks (to use eBPF features like maps), but I'm not sure to get the use case here. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From luto at amacapital.net Tue Feb 27 23:11:37 2018 From: luto at amacapital.net (Andy Lutomirski) Date: Tue, 27 Feb 2018 23:11:37 +0000 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <5323e010-09df-26d9-15f5-c723faa13224@digikod.net> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> <5323e010-09df-26d9-15f5-c723faa13224@digikod.net> Message-ID: On Tue, Feb 27, 2018 at 11:10 PM, Micka?l Sala?n wrote: > > On 27/02/2018 05:54, Andy Lutomirski wrote: >> >> >>> On Feb 26, 2018, at 8:38 PM, Kees Cook wrote: >>> >>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski wrote: >>>>> On Feb 26, 2018, at 3:20 PM, Kees Cook wrote: >>>>> >>>>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov >>>>> wrote: >>>>>>> On Mon, Feb 26, 2018 at 07:26:54AM +0000, Sargun Dhillon wrote: >>>>>>> This patchset enables seccomp filters to be written in eBPF. Although, this >>>>>>> [...] >>>>>> The main statement I want to hear from seccomp maintainers before >>>>>> proceeding any further on this that enabling eBPF in seccomp won't lead >>>>>> to seccomp folks arguing against changes in bpf core (like verifier) >>>>>> just because it's used by seccomp. >>>>>> It must be spelled out in the commit log with explicit Ack. >>>>> >>>>> The primary thing I'm concerned about with eBPF and seccomp is >>>>> side-effects from eBPF programs running at syscall time. This is an >>>>> extremely sensitive area, and I want to be sure there won't be >>>>> feature-creep here that leads to seccomp getting into a bad state. >>>>> >>>>> As long as seccomp can continue have its own verifier, I *think* this >>>>> will be fine, though, again I remain concerned about maps, etc. I'm >>>>> still reviewing these patches and how they might provide overlap with >>>>> Tycho's needs too, etc. >>>> >>>> I'm not sure I see this as a huge problem. As far as I can see, there >>>> are three ways that a verifier change could be problematic: >>>> >>>> 1. Addition of a new type of map. But seccomp would just not allow >>>> new map types by default, right? >>>> >>>> 2. Addition of a new BPF_CALLable helper. Seccomp wants a way to >>>> whitelist BPF_CALL targets. That should be straightforward. >>> >>> Yup, agreed on 1 and 2. >>> >>>> 3. Straight-up bugs. Those are exactly as problematic as verifier >>>> bugs in any other unprivileged eBPF program type, right? I don't see >>>> why seccomp is special here. >>> >>> My concern is more about unintended design mistakes or other feature >>> creep with side-effects, especially when it comes to privileges and >>> synchronization. Getting no-new-privs done correctly, for example, >>> took some careful thought and discussion, and I'm shy from how painful >>> TSYNC was on the process locking side, and eBPF has had some rather >>> ugly flaws in the past (and recently: it was nice to be able to say >>> for Spectre that seccomp filters couldn't be constructed to make >>> attacks but eBPF could). Adding the complexity needs to be worth the >>> gain. I'm on board for doing it, I just want to be careful. :) >>> >> >> I agree. I think that, if we do this right, we get a clean version of Tycho's notifiers. We can also very easily build on that to send a non-blocking message to the notifier fd, which gets us a version of seccomp logging that works for things like Chromium and even strace. I think this is worth it. >> >> I also think this sort of argument is why Micka?l's privileged-first Landlock is the wrong approach. By getting the unprivileged parts right from day one, we can carefully extend the mechanism and keep it usable by unprivileged apps. But, if we'd started as root-only, fixing up everything needed to make it safe for unprivileged users after the fact would have been quite messy. > > We agreed (including Kees and you, at the Santa Fe LPC) to limit the use > of Landlock to CAP_SYS_ADMIN at first. It is an artificial limitation > that can be re-enabled by removing three explicit checks/lines. Landlock > was designed for unprivileged use from day one and it is still the goal. Indeed. I was obviously too tired to read your email intelligently last night. Sorry. From chris.hyser at oracle.com Tue Feb 27 23:55:04 2018 From: chris.hyser at oracle.com (chris hyser) Date: Tue, 27 Feb 2018 18:55:04 -0500 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <7fc0fab8-c1bc-bc76-a892-b3faab7d16ad@oracle.com> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> <7fc0fab8-c1bc-bc76-a892-b3faab7d16ad@oracle.com> Message-ID: <4fbef77e-92ad-b896-a259-492412ad4c55@oracle.com> > On 02/27/2018 04:58 PM, Daniel Borkmann wrote: >> On 02/27/2018 05:59 PM, chris hyser wrote: >>> On 02/27/2018 11:00 AM, Kees Cook wrote: >>>> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser wrote: >>>>> On 02/26/2018 11:38 PM, Kees Cook wrote: >>>>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >>>>>> wrote: >>>>>>> >>>>>>> 3. Straight-up bugs.? Those are exactly as problematic as verifier >>>>>>> bugs in any other unprivileged eBPF program type, right?? I don't see >>>>>>> why seccomp is special here. >>>>>> >>>>>> My concern is more about unintended design mistakes or other feature >>>>>> creep with side-effects, especially when it comes to privileges and >>>>>> synchronization. Getting no-new-privs done correctly, for example, >>>>>> took some careful thought and discussion, and I'm shy from how painful >>>>>> TSYNC was on the process locking side, and eBPF has had some rather >>>>>> ugly flaws in the past (and recently: it was nice to be able to say >>>>>> for Spectre that seccomp filters couldn't be constructed to make >>>>>> attacks but eBPF could). Adding the complexity needs to be worth the >> >> Well, not really. One part of all the Spectre mitigations that went upstream >> from BPF side was to have an option to remove interpreter entirely and that >> also relates to seccomp eventually. But other than that an attacker might >> potentially find as well useful gadgets inside seccomp or any other code >> that is inside the kernel, so it's not a strict necessity either. >> >>>>>> gain. I'm on board for doing it, I just want to be careful. :) >>>>> >>>>> Another option might be to remove c/eBPF from the equation all together. >>>>> c/eBPF allows flexibility and that almost always comes at the cost of >>>>> additional security risk. Seccomp is for enhanced security yes? How about a >>>>> new seccomp mode that passes in something like a bit vector or hashmap for >>>>> "simple" white/black list checks validated by kernel code, versus user >>>>> provided interpreted code? Of course this removes a fair number of things >>>>> you can currently do or would be able to do with eBPF. Of course, restated >>>>> from a security point of view, this removes a fair number of things an >>>>> _attacker_ can do. Presumably the performance improvement would also be >>>>> significant. >> >> Good luck with not breaking existing applications relying on seccomp out >> there. > > This wasn't in the context of an implementation proposal, but the assumption would be to add this in addition to the old > way. Now, does that make sense to do? That is the discussion. > >> >>>>> Is this an idea worth prototyping? >>>> >>>> That was the original prototype for seccomp-filter. :) The discussion >>>> around that from years ago basically boiled down to it being >>>> inflexible. Given all the things people want to do at syscall time, >>>> that continues to be true. So true, in fact, that here we are now, >>>> trying to move to eBPF from cBPF. ;) >> >> Right, agree. cBPF is also pretty much frozen these days and aside from >> that, seccomp/BPF also just uses a proper subset of it. I wouldn't mind >> doing something similar for eBPF side as long as this is reasonably >> maintainable and not making BPF core more complex, but most of it can >> already be set in the verifier anyway based on prog type. Note, that >> performance of seccomp/BPF is definitely a demand as well which is why >> people still extend the old remaining cBPF JITs today such that it can >> be JITed also from there. >> >>> I will try to find that discussion. As someone pointed out here though, eBPF is being used by more and more people in >>> areas where security is not the primary concern. Differing objectives will make this a long term continuing issue. We >>> ourselves were looking at eBPF simply as a means to use a hashmap for a white/blacklist, i.e. performance not >>> flexibility. >> >> Not really, security of verifier and BPF infra in general is on the top >> of the list, it's fundamental to the underlying concept and just because >> it is heavily used also in tracing and networking, it only shows that the >> concept is highly flexible that it can be applied in multiple areas. If you're implying that because seccomp would have it's own verifier and could therefore restrict itself to a subset of eBPF, therefore any future additions/features to eBPF would not necessarily make seccomp less secure, I mainly agree. Is that the argument? -chrish From daniel at iogearbox.net Wed Feb 28 19:56:45 2018 From: daniel at iogearbox.net (Daniel Borkmann) Date: Wed, 28 Feb 2018 20:56:45 +0100 Subject: [net-next v3 0/2] eBPF seccomp filters In-Reply-To: <4fbef77e-92ad-b896-a259-492412ad4c55@oracle.com> References: <20180226072651.GA27045@ircssh-2.c.rugged-nimbus-611.internal> <20180226230418.46nczgkh5csakyu7@ast-mbp> <7fc0fab8-c1bc-bc76-a892-b3faab7d16ad@oracle.com> <4fbef77e-92ad-b896-a259-492412ad4c55@oracle.com> Message-ID: <19cd2e07-5702-1713-6903-e5667250b09d@iogearbox.net> On 02/28/2018 12:55 AM, chris hyser wrote: >> On 02/27/2018 04:58 PM, Daniel Borkmann wrote: >> On 02/27/2018 05:59 PM, chris hyser wrote: >>>> On 02/27/2018 11:00 AM, Kees Cook wrote: >>>>> On Tue, Feb 27, 2018 at 6:53 AM, chris hyser wrote: >>>>>> On 02/26/2018 11:38 PM, Kees Cook wrote: >>>>>>> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski >>>>>>> wrote: >>>>>>>> >>>>>>>> 3. Straight-up bugs.? Those are exactly as problematic as verifier >>>>>>>> bugs in any other unprivileged eBPF program type, right?? I don't see >>>>>>>> why seccomp is special here. >>>>>>> >>>>>>> My concern is more about unintended design mistakes or other feature >>>>>>> creep with side-effects, especially when it comes to privileges and >>>>>>> synchronization. Getting no-new-privs done correctly, for example, >>>>>>> took some careful thought and discussion, and I'm shy from how painful >>>>>>> TSYNC was on the process locking side, and eBPF has had some rather >>>>>>> ugly flaws in the past (and recently: it was nice to be able to say >>>>>>> for Spectre that seccomp filters couldn't be constructed to make >>>>>>> attacks but eBPF could). Adding the complexity needs to be worth the >>> >>> Well, not really. One part of all the Spectre mitigations that went upstream >>> from BPF side was to have an option to remove interpreter entirely and that >>> also relates to seccomp eventually. But other than that an attacker might >>> potentially find as well useful gadgets inside seccomp or any other code >>> that is inside the kernel, so it's not a strict necessity either. >>> >>>>>>> gain. I'm on board for doing it, I just want to be careful. :) >>>>>> >>>>>> Another option might be to remove c/eBPF from the equation all together. >>>>>> c/eBPF allows flexibility and that almost always comes at the cost of >>>>>> additional security risk. Seccomp is for enhanced security yes? How about a >>>>>> new seccomp mode that passes in something like a bit vector or hashmap for >>>>>> "simple" white/black list checks validated by kernel code, versus user >>>>>> provided interpreted code? Of course this removes a fair number of things >>>>>> you can currently do or would be able to do with eBPF. Of course, restated >>>>>> from a security point of view, this removes a fair number of things an >>>>>> _attacker_ can do. Presumably the performance improvement would also be >>>>>> significant. >>> >>> Good luck with not breaking existing applications relying on seccomp out >>> there. >> >> This wasn't in the context of an implementation proposal, but the assumption would be to add this in addition to the old way. Now, does that make sense to do? That is the discussion. I see; didn't read that out from the above when you also mentioned removing cBPF, but fair enough. >>>>>> Is this an idea worth prototyping? >>>>> >>>>> That was the original prototype for seccomp-filter. :) The discussion >>>>> around that from years ago basically boiled down to it being >>>>> inflexible. Given all the things people want to do at syscall time, >>>>> that continues to be true. So true, in fact, that here we are now, >>>>> trying to move to eBPF from cBPF. ;) >>> >>> Right, agree. cBPF is also pretty much frozen these days and aside from >>> that, seccomp/BPF also just uses a proper subset of it. I wouldn't mind >>> doing something similar for eBPF side as long as this is reasonably >>> maintainable and not making BPF core more complex, but most of it can >>> already be set in the verifier anyway based on prog type. Note, that >>> performance of seccomp/BPF is definitely a demand as well which is why >>> people still extend the old remaining cBPF JITs today such that it can >>> be JITed also from there. >>> >>>> I will try to find that discussion. As someone pointed out here though, eBPF is being used by more and more people in areas where security is not the primary concern. Differing objectives will make this a long term continuing issue. We ourselves were looking at eBPF simply as a means to use a hashmap for a white/blacklist, i.e. performance not flexibility. >>> >>> Not really, security of verifier and BPF infra in general is on the top >>> of the list, it's fundamental to the underlying concept and just because >>> it is heavily used also in tracing and networking, it only shows that the >>> concept is highly flexible that it can be applied in multiple areas. > > If you're implying that because seccomp would have it's own verifier and could therefore restrict itself to a subset of eBPF, therefore any future additions/features to eBPF would not necessarily make seccomp less secure, I mainly agree. Is that the argument? Ok, in addition to the current unpriv restrictions imposed by the verifier, what additional requirements would you have from your side in order to get to semantics that make sense for you wrt seccomp/eBPF? Just trying to understand how far we are away from that. Note that not every new feature, map or helper is enabled for every program type of course. Thanks, Daniel > -chrish > > >