[Ksummit-2013-discuss] NUMA locality for storage

Christoph Lameter cl at gentwo.org
Tue Jul 30 19:11:17 UTC 2013


On Tue, 30 Jul 2013, Ben Hutchings wrote:

> > Note also that recent storage controllers warm up the L3 cache of the
> > socket that connects to their PCI bus when transferring data into memory.
> > Access from local cores will be much faster than remote.
>
> It sounds like you're talking about DDIO, but that is a feature of
> recent Intel CPUs that doesn't require anything special from the device.

Right. DDIO will only work on the local processor socket. If you use a
remote socket to do I/O you will incur a significant performance penalty.

In order to use DDIO you need to be on the correct processor socket.

> > Performance will degrade significantly in a multi socket system if the
> > processing occurs on another socket.
> >
> > This effect is so significant that I would recommend that the
> > default behavior should be a detection of the NUMA locality of the PCI
> > device and the associated cores. I/O processing for that device then needs
> > to be restricted to those cores.
>
> PCIe devices may use TLP Processing Hints (TPH) to indicate that DMA
> writes will be consumed by a particular CPU; this should cause them to
> be delivered to the L3 cache on the appropriate node.  I'm not sure how
> much you can expect this to narrow the I/O performance gap, in general.

Well yeah a TLP hint could get the remote processor to do the caching. I
am not aware of any use of TLPs today though. Its an arcane feature and
even Intel is hesitant to use it.

But even a TLP hint wont get you all the performance of local I/O because
there will always be an additional QP hop involved. The best solution is
to simply do I/O as local as possible.




More information about the Ksummit-2013-discuss mailing list