What is RDMA, and why should we care?

DRBD9 has a new transport abstraction layer and it is designed for speed; apart from SSOCKS and TCP the next generation link will be RDMA.

So, what is RDMA, and how is it different from TCP?

The TCP transport is a streaming protocol, which for nearly all Linux set ups means that the Linux Kernel takes care to deliver the messages in order and without losing any data. [1. While specialized hardware is available and helps a bit by calculating the TCP checksum, we have seen that these can make more problems than they solve.]  To send these messages, the TCP transport has to copy the supplied data into some buffers, which takes a bit of time. Yes, zerocopy-send solutions exist, but on the receiving side the fragments have to be accumulated, sorted, and merged into buffers so that the storage (harddisks or SSD) can do its DMA from continuous 4KiB pages.
These internal copy functions moving into and out of buffers cause one of the major bottlenecks for network IO, and you can start to see the performance degradation in the 10GBit/sec performance range, it continues to severely limit performance from there on up.  All these copy functions create and cause higher latency, effecting that all important IOPS number.  We talk about this in our user guide: Latency vs. IOPs.

In contrast to that, RDMA gives network hardware the ability to directly move data from RAM in one machine to RAM in another, without involving the CPU (apart from specifying what should be transferred). It comes in various forms and implementations ­(Infiniband, iWarp, RoCE) and with different on-wire protocols (some use IP, can therefore be routed, and so could be seen as “just” an advanced offload engine).

The common and important point is that the sender and receiver do not have to bother with splitting the data up (into MTU-sized chunks) or joining it back together (to get a single, aligned, 4KiB page that can be transmitted to storage, for example) – they just specify “here are 16 pages of 4kiB, please store data coming from this channel into these next time” and “please push those 32KiB across this channel“. This means real zero-copy send and receive, and much lower latency.

Another interesting fact is that some hardware allows splitting the physical device into multiple virtual ones; this feature is called SR-IOV, and it means that a VM can push memory pages directly to another machine, without involving the hypervisor OS or copying data around. Needless to say that this should improve performance quite a bit, as compared to cutting data into pieces and moving them through the hypervisor… 😉

Since we started on the transport layer abstraction in 7d7a29ae8 quite some effort was spent in that area; currently we’re doing a few benchmarks, and we’re about to publish performance results in the upcoming weeks – so stay tuned!

Spoiler alert: we’re going to use RAM-disks as “storage”, because we don’t have any fast-enough storage media available…

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *