[Ksummit-2013-discuss] NUMA locality for storage

Tue Jul 30 17:03:21 UTC 2013

On Tue, 30 Jul 2013, Matthew Wilcox wrote:

> I've seen the presentation you cite to support RDMA-for-storage.  It's
> unconvincing.  I've investigated extending NVM Express so that the queues
> are usable from userspace, and it turns out to be a terrible idea.
>  Everything you gain from "bypassing the kernel", you lose once you realise
> that the kernel was "too slow" because it was doing useful work on your
> behalf.  That you now have to replicate, badly, in userspace.  I was
> contacted by an engineer who did a similar project at a different employer;
> they had an even worse experience ... they succeeded in creating hardware!
>  And by the time they were done, they realised it would have been far
> cheaper and quicker to write a decent device driver.  Now they have to get
> rid of all the changes to their applications in order to do a second
> generation of their device ... because there's no way they're bringing
> forward all of the RDMA-style gunk.

Yes we are talking about direct access to hardware. Can we get to a point
where the kernel activities (such as maintenance and control) can be
performed on other processors that are not in the performance critical
path?

Active processing of data by the kernel is a bad idea for very high
performance devices since users expect to use the full bandwidth to the
I/O device which may be the full speed of the PCI-E bus.

Modern hardware has capabilities that can offload most of data processing
for storage such as raid checksums etc etc. What the OS needs to do is the
control and administer the flow of data. It should not be in the direct
data path.

> Modern network devices don't work the way you think they do, and neither do
> modern storage devices. Threads aren't woken up near the device, they're
> woken up near the queue they submitted work on.  There are definite
> performance advantages to using a queue near the device, but they may not
> be outweighed by being near to some other part of the system that the
> thread happens to be using.

As I said all of this needs to be orchestrated by the kernel and properly
positioned for maximum performance. The queue needs to live on
cores that are on the numa node the device is attached to.

The problem with the queue is that it is in kernel space and therefore
another interface layer exists to user space that may limit performance.
The advantage of the NVP approach is that the queue is in user space in
some form.

> If your problem partitions neatly enough that you can treat everything
> separately, then use separate machines.  This is about doing I/O from all
> CPUs to all devices, and figuring out heuristics to get the best result
> without requiring the sysadmin to understand every little detail.

Right so the kernel should come up with a default arrangement of the
kernel threads, user threads and the processing so that we can reach
maximum locality for memory I/O etc etc. If there are separate user space
threads that do I/O to devices on different processor sockets then they
need to have mininum interference with each other and not limit each
others bandwidth to the device.