RFC: netfilter: nf_conntrack: add support for "conntrack zones"

Patrick McHardy kaber at trash.net
Thu Jan 14 06:05:32 PST 2010


The attached largish patch adds support for "conntrack zones",
which are virtual conntrack tables that can be used to seperate
connections from different zones, allowing to handle multiple
connections with equal identities in conntrack and NAT.

A zone is simply a numerical identifier associated with a network
device that is incorporated into the various hashes and used to
distinguish entries in addition to the connection tuples. Additionally
it is used to seperate conntrack defragmentation queues. An iptables
target for the raw table could be used alternatively to the network
device for assigning conntrack entries to zones.

This is mainly useful when connecting multiple private networks using
the same addresses (which unfortunately happens occasionally) to pass
the packets through a set of veth devices and SNAT each network to a
unique address, after which they can pass through the "main" zone and
be handled like regular non-clashing packets and/or have NAT applied a
second time based f.i. on the outgoing interface.

Something like this, with multiple tunl and veth devices, each pair
using a unique zone:

  <tunl0 / zone 1>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to unique network
     |
  <veth1 / zone 1>
  <veth0 / zone 0>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to eth0 address
     |
  <eth0>

As probably everyone has noticed, this is quite similar to what you
can do using network namespaces. The main reason for not using
network namespaces is that its an all-or-nothing approach, you can't
virtualize just connection tracking. Beside the difficulties in
managing different namespaces from f.i. an IKE or PPP daemon running
in the initial namespace, network namespaces have a quite large
overhead, especially when used with a large conntrack table.

I'm not too fond of this partial feature duplication myself, but I
couldn't think of a better way to do this without the downsides of
using namespaces. Having partially shared network namespaces would
be great, but it doesn't seem to fit in the design very well.
I'm open for any better suggestion :)

A couple of notes on the patch:

- its not entirely finished yet (ctnetlink and xt_connlimit are
  missing), I wanted to have a discussion about the general idea first.

- the patch uses ct_extend to avoid increasing the connection tracking
  entry size when this feature is not used. An older version of this
  patch adds the zone identifier to the conntrack tuples. This greatly
  simplifies the changes to the code since the zone doesn't has to
  passed around (something like 40 lines total), but has the downside
  of increasing the tuple size.

- the overhead should be quite small, its mainly the extra argument
  passing and an occasional extra comparison. Code size increase with
  all netfilter options enabled on x86_64 is 152 bytes.

Any comments welcome.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 01.diff
Type: text/x-patch
Size: 50283 bytes
Desc: not available
Url : http://lists.linux-foundation.org/pipermail/containers/attachments/20100114/71165c0b/attachment-0001.bin 


More information about the Containers mailing list