[Ksummit-discuss] [TECH TOPIC] Bus IPC

David Herrmann dh.herrmann at gmail.com
Thu Jul 28 22:24:03 UTC 2016


Tom Gundersen and I would like to propose a technical session on
in-kernel IPC systems. For roughly half a year now we have been
developing (with others) a capability-based [1] IPC system for linux,
called bus1 [2]. We would like to present bus1, start a discussion on
open problems, and talk about the possible path forward for an upstream
inclusion.

While bus1 emerged out of the kdbus project, it is a new, independent
project, designed from scratch. Its main goal is to implement an n-to-n
communication bus on linux. A lot of inspiration is taken from both
DBus, as well as the the most commonly used IPC systems of other OSs,
and related research projects (including Android Binder, OS-X/Hurd Mach
IPC, Solaris Doors, Microsoft Midori IPC, seL4, Sandstorm's Cap'n'Proto,
..).

The bus1 IPC system was designed to...

 o be a machine-local IPC system. It is a fast communication channel
   between local threads and processes, independent of the marshaling
   format used.

 o provide secure, reliable capability-based [1] communication. A
   message is always invoked on a capability, requiring the caller to
   own said capability, otherwise it cannot perform that operation.

 o efficiently support n-to-n communication. Every peer can communicate
   with every other peer (given the right capabilities), with minimal
   overhead for state-tracking.

 o be well-suited for both unicast and multicast messages.

 o guarantee a global message order [3], allowing clients to rely on
   causal ordering between messages they send and receive (for further
   reading, see Leslie Lamport's work on distributed systems [4]).

 o scale with the number of CPUs available. There is no global context
   specific to the bus1 IPC, but all communication happens based on
   local context only. That is, if two independent peers never talk to
   each other, their operations never share any memory (no shared
   locks, no shared state, etc.).

 o avoid any in-kernel buffering and rather transfer data directly
   from a sender into the receiver's mappable queue (single-copy).

A user-space implementation of bus1 (or even any bus-based IPC) was
considered, but was found to have several seemingly unavoidable issues.

 o To guarantee reliable, global message ordering including multicasts,
   as well as to provide reliable capabilities, a bus-broker is
   required. In other words, the current linux syscall API is not
   sufficient to implement the design as described above in an efficient
   way without a dedicated, trusted, privileged process that manages the
   bus and routes messages between the peers.

 o Whenever a bus-broker is involved, any message transaction between
   two clients requires the broker process to execute code in its own
   time-slice. While this time-slice can be distributed fairly across
   clients, it is ultimately always accounted on the user of the broker,
   rather than the originating user. Kernel time-slice accounting, and
   the accounting in the broker are completely separated and cannot make
   decisions based on the data of each other.
   Furthermore, the broker needs to be run with quite excessive resource
   limits and execution rights to be able to serve requests of high
   priority peers, making the same resources available to low priority
   peers as well.
   An in-kernel IPC mechanism removes the requirement for such a highly
   privileged bus-broker, and rather accounts any operation and resource
   exactly on the calling user, cgroup, and process.

 o Bus ipc often involves peers requesting services from other trusted
   peers, and waiting for a possible result before continuing. If
   said trust relationship is given, privileged processes actively want
   priority inheritance when calling into less privileged, but trusted
   processes. There is currently no known way to implement this in a
   user-space broker without requiring n^2 PI-futex pairs.

 o A userspace broker would entail two UDS transactions and potentially
   an extra context-switch, compared to a single bus1 transaction with
   the in-kernel broker. Our x86-benchmarks (before any serious
   optimization work has started) shows that two UDS transactions are
   always slower than one bus1 transaction. On top of that comes the
   extra context switch, which has about the same cost as a full bus1
   transaction, as well as any time spent in the broker itself. With an
   imaginary no-overhead broker, we found an in-kernel broker to be >40%
   faster. The numbers will differ between machines, but the reduced
   latency is undeniable.

 o Accounting of inflight resources (e.g., file-descriptors) in a broker
   is completely broken. Right now, any outgoing message of a broker
   will account FDs on the broker, however, there is no way for the
   broker to track outgoing FDs. As such, it cannot attribute them on
   the original sender of the FD, opening up for DoS attacks.

 o LSMs and audit cannot hook into the broker, nor get any additional
   routing information. Thus, audit cannot log proper information, and
   LSMs need to hook into a user-space process, relying on them to
   implement the wanted security model.

 o The kernel itself can never operate on the bus, nor provide services
   seamlessly to user-space (e.g., like netlink does), unless it is
   implemented in the kernel.

 o If a broker is involved, no communication can be ordered against
   side-channels. A kernel implementation, on the other hand, provides
   strong ordering against any other event happening on the system.

The implemention of bus1.ko with its <5k LOC is relatively small, but
still takes a considerable amount of time to review and understand. We
would like to use the kernel-summit as an opportunity to present bus1,
and answer questions on its design, implementation, and use of other
kernel subsystems. We encourage everyone to look into the sources, but
we still believe that a personal discussion up-front would save everyone
a lot of time and energy. Furthermore, it would also allow us to
collectively solve remaining issues.

Everyone interested in IPC is invited to the discussion. In particular,
we would welcome everyone who participated in the Binder and kdbus
discussions, is involed in shmem+memcg (or other bus1-related
subsystems), possibly including:

 o Andy Lutomirski
 o Greg Kroah-Hartman
 o Steven Rostedt
 o Eric W. Biederman
 o Jiri Kosina
 o Borislav Petkov
 o Michal Hocko (memcg)
 o Johannes Weiner (memcg)
 o Hugh Dickins (shmem)
 o Tom Gundersen (bus1)
 o David Herrmann (bus1)

Thanks!
    Tom, David

[1] https://en.wikipedia.org/wiki/Capability-based_security
[2] http://www.bus1.org
[3] https://github.com/bus1/bus1/wiki/Message-ordering
[4] http://amturing.acm.org/p558-lamport.pdf


More information about the Ksummit-discuss mailing list