[PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

Wed Aug 27 15:17:01 UTC 2014

On 14/08/25, Andy Lutomirski wrote:
> On Mon, Aug 25, 2014 at 9:41 AM, Nicolas Dichtel
> <nicolas.dichtel at 6wind.com> wrote:
> > Le 25/08/2014 18:13, Andy Lutomirski a écrit :
> >
> >> On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
> >> <nicolas.dichtel at 6wind.com> wrote:
> >>>
> >>> Le 25/08/2014 16:04, Andy Lutomirski a écrit :
> >>>
> >>>> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel at 6wind.com>
> >>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> CRIU wants to save the complete state of a namespace and then restore
> >>>>>> it.  For that to work, any information exposed to things in the
> >>>>>> namespace *cannot* be globally unique or unique per boot, since CRIU
> >>>>>> needs to arrange for that information to match whatever it was when
> >>>>>> CRIU saved it.
> >>>>>
> >>>>>
> >>>>>
> >>>>> How are ifindex of network devices managed? These ifindexes are unique
> >>>>> per boot,
> >>>>> thus can change depending on the order in which netdev are created.
> >>>>> These ifindexes are unique per boot and exposed to userspace ...
> >>>>>
> >>>>
> >>>> This does not appear to be true.
> >>>>
> >>>> $ sudo unshare --net
> >>>> # ip link add veth0 type veth peer name veth1
> >>>> # ip link
> >>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
> >>>> default
> >>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >>>> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> >>>> DEFAULT group default qlen 1000
> >>>>       link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
> >>>> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> >>>> DEFAULT group default qlen 1000
> >>>>       link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
> >>>> # logout
> >>>> $ ip link
> >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> >>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >>>> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
> >>>> state DOWN qlen 1000
> >>>>
> >>> I've probably misunderstood what you're trying to say. ifindexes are
> >>> unique
> >>> per
> >>> boot and per netns.
> >>
> >>
> >> I think we both misunderstood each other.  The ifindexes are unique
> >> *per netns*, which means that, if you're unprivileged in a netns,
> >> global information doesn't leak to you.  I think this is good.
> >
> > Ok, I agree. I think audit daemons are always running under privileged
> > users.
> >
> >
> >>
> >>>>
> >>>> Let me try again, with emphasis in the right place.
> >>>>
> >>>> I think that *code running in a namespace* has no business even
> >>>> knowing a unique identity of *that namespace* from the perspective of
> >>>> the host.
> >>>>
> >>>> In your example, if there's a veth device between netns A and netns B,
> >>>> then code *in netns A* has no business knowing the identity of its
> >>>> veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
> >>>> no business knowing the identity of its own netns (A) other than as
> >>>> "my netns".
> >>>
> >>>
> >>> I do not agree (see the example below).
> >>>
> >>>
> >>>>
> >>>> If A and B are siblings, then their parent needs to know where that
> >>>> veth device goes, but I think this is already the case to a sufficient
> >>>> extent today.
> >>>
> >>>
> >>> I'm not aware of a hierarchy between netns. A daemon should be able to
> >>> got the full network configuration, even if it's started when this
> >>> configuration
> >>> is already applied, ie even if it doesn't know what happen before it
> >>> starts.
> >>>
> >>
> >> I don't know exactly which namespaces have an explicit hierarchy, but
> >> there is certainly a hierarchy of *user* namespaces, and network
> >> namespaces live in user namespaces, so they at least have somewhat of
> >> a hierarchy.
> >>
> >>>
> >>>>
> >>>> I feel like this discussion is falling into a common trap of new API
> >>>> discussions.  Can one of you who wants this API please articulate,
> >>>> with a reasonably precise example, what it is that you want to do, why
> >>>> you can't easily do it already, and how this API helps?  I currently
> >>>> understand how the API creates problems, but I don't understand how it
> >>>> solves any problems, and I will NAK it (and I suspect that Eric will,
> >>>> too, which is pretty much fatal) unless that changes.
> >>>
> >>>
> >>> What I'm trying to solve is to have full info in netlink messages sent by
> >>> the
> >>> kernel, thus beeing able to identify a peer netns (and this is close from
> >>> what
> >>> audit guys are trying to have). Theorically, messages sent by the kernel
> >>> can
> >>> be
> >>> reused as is to have the same configuration. This is not the case with
> >>> x-netns
> >>> devices. Here is an example, with ip tunnels:
> >>>
> >>> $ ip netns add 1
> >>> $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev
> >>> eth0
> >>> $ ip -d link ls ipip1
> >>> 8: ipip1 at eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode
> >>> DEFAULT group default
> >>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
> >>>      ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit
> >>> pmtudisc
> >>> $ ip link set ipip1 netns 1
> >>> $ ip netns exec 1 ip -d link ls ipip1
> >>> 8: ipip1 at tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
> >>> mode DEFAULT group default
> >>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
> >>>      ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit
> >>> pmtudisc
> >>>
> >>> Now informations got with 'ip link' are wrong and incomplete:
> >>>   - the link dev is now tunl0 instead of eth0, because we only got an
> >>> ifindex
> >>>     from the kernel without any netns informations.
> >>>   - the encapsulation addresses are not part of this netns but the user
> >>> doesn't
> >>>     known that (still because netns info is missing). These IPv4
> >>> addresses
> >>> may
> >>>     exist into this netns.
> >>>   - it's not possible to create the same netdevice with these infos.
> >>>
> >>
> >> Aha.  That's a genuine problem.
> >>
> >> Perhaps we need a concept of which netnses should be able to see each
> >> other.
> >
> > Yes, I agree. This is not required for all netns, only a subset of netns
> > should
> >
> > be able to see each other.
> >
> >>
> >> I think I would be okay with a somewhat different outcome from your
> >> example:
> >>
> >> $ ip netns exec 1 ip -d link ls ipip1
> >> 8: ipip1@[unknown device in another namespace]:
> >> <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
> >>
> >> I think this outcome is mandatory if netns 1 lives in a subsidiary
> >> user namespace.
> >
> > Yes.
> >
> >
> >>
> >> Certainly, if you do the 'ip link' in the original namespace, I agree
> >> that this should work.
> >
> > And yes :)
> >
> > I will update my previous proposal
> > (http://thread.gmane.org/gmane.linux.network/315933/focus=321753)
> > to allow to get an id for a peer netns only when the user namespace is the
> > same.
> 
> I think it should work if the peer userns is the same or a descendent.
> I also wonder whether the peer's ifindex should be suppressed if peer
> userns is not the same or a descendent.
> 
> Now you just have to get Eric to be happy with the id allocation. :)
> This may be nontrivial.
> 
> >> For most namespace types, this all works transparently, since
> >> everything has an real identity all the way up the hierarchy.  Network
> >> namespaces are different.
> >>
> >> I don't think that exposing serial numbers in /proc is a good
> >> solution, both for the reasons already described and because I don't
> >> think that iproute2 should need to muck around with /proc to function
> >
> > A netlink API is probably enough. But it will help only for the network
> > problem, not for audit. I was hoping to find a common solution.
> 
> I still don't understand why audit needs anything beyond the audit
> part of this patch set.  I have no problem with audit seeing that
> migrated/restored namespaces are really brand-new namespaces, as long
> as the code in those namespaces isn't exposed to it.

Ok, I'm starting to get this...  Perhaps /proc wasn't the best place to
expose this.  Audit or an audit aggregator is the only one that needs to
know any of this information.  This could be accomplished with
CAP_AUDIT_CONTROL and a new netlink audit message type to fetch
individual or all namespace IDs for a particular PID via auditctl, or by
having a CAP_AUDIT_WRITE-capable application pull the trigger to simply
dump that information to the log.

> >> correctly.  Eric, any clever ideas here?  Do we need fancier netlink
> >> messages for this?
> >>
> >> --Andy
> 
> Andy Lutomirski

- RGB

--
Richard Guy Briggs <rbriggs at redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545