RDMA Performance

Posted onApril 6, 2016

As promised in the previous RDMA post, we gathered some performance data for the RDMA transport. Read and enjoy!

Basic Hardware Information

Two IBM 8247-22L’s (Power8, 2 sockets * 10 CPUs, hyperthreading turned off)
128GiByte RAM
ConnectX4 Infiniband, two connections with 100Gbit each
The DRBD® TCP connection was run across one “bnx2x” 10Gbit adapter pair (i.e. one in each server, no bonding)
dm-zero as we don’t have fast enough storage available. There was no IO scheduler directly on the hardware within the VM we switched to ‘noop‘.
NOTE: if you’d like to see performance data with actual, persistent storage being used, check out our newest Tech Guide – DRBD9 on an Ultrastar SN150 NVMe SSD.

The Software We Used

Ubuntu Xenial (not officially released at the time of testing)
Linux Kernel version 4.4.0-15-generic (ppc64el)
DRBD 9.0.1-1 (ded61af75823)
DRBD RDMA transport 2.0.0
fio version 2.2.10-1

Our underlying block devices have some ‘persistent’ (ha!) space at the beginning and the end, to keep the DRBD and filesystem superblocks; the rest in the middle was mapped to dm-zero:

zero-block-1: 0 8192 linear 1:1 0
zero-block-1: 8192 2147483648 zero
zero-block-1: 2147491840 8192 linear 1:1 8192

Due to many variables and potential test cases, we restricted fios run-time to 10 seconds (as that should be good enough for statistical purposes, see below).

The graphics only show data for a single thread (but for multiple IO depths), for ease of reading.

For the performance point – here is fio directly on the hardware (i.e. without a virtualization layer in between).

DRBD9, connected via RDMA over 100Gbit IB, writing to dm-zero

This graphic shows a few points worth highlighting:

For small block sizes (4kiB & 16kiB), the single-threaded/io-depth=1 performance is about 10k IOPsfootnote:[That, times 10 seconds, amounts to 100,000 measurements – an excellent statistical base: with io-depth=2 it’s 20k IOPs, and when io-depth is 8 or higher, we reach the top performance of ~48k IOPs.
For large block sizes, the best bandwidth result is a tad above 11GiB/sec (sic!)
Last but not least, the best latency was below 40µsec! For two threads, io-depth=2, 4KiB block size, we had this result:

  lat (usec): min=39, max=4038, avg=97.44, stdev=36.22

Using TCP Instead Of RDMA

As a slight comparison, here’s the same setup, but using TCP instead of RDMA; we kept the same scale to make comparison easier.

DRBD9, connected via TCP on 10Gbit, writing to dm-zero

As you can see, copying data around isn’t efficient – TCP is slower, topping out at 1.1GiB/sec on this hardware. But I have to admit, apart from tcp_rmem and tcp_wmem, I didn’t do any tuning here either.

Now, we move on to results from within a VM; let’s start with reading.

The VM sees the DRBD device as /dev/sdb; we set the scheduler to ‘noop’ to not interfere with the read IOs.

Reading In A VM, DRBD Handled In Hypervisor

Here we get pretty positive results, too:

3.2GiB/sec, within the VM, should be “good enough” for most purposes, right?
~20k IOPs for some io-depth and still 3.5k IOPs with sequential IO is still better than hard disks on hardware.

Our next milestone is writing. Write requests have additional constraints (compared to reading) – every single write request done in the VM has to be replicated (and confirmed) in DRBD in the Hypervisor before the okay is relayed to the VM’s application.

Writing From VM, DRBD In Hypervisor

The most visible difference is the bandwidth – it tops out at ~1.1GiB/sec.

Now, we measured these bandwidths in a hyper-converged setup. The host running the VM has a copy of the data available locally. As that might not always be the case, I detached this LV and tested again.

So, if the hypervisor does not have local storage (but always has to ask some other node), we get these pictures:

Reading within a VM, remote storage only Writing from a VM, remote storage only

As we can see, the results are mostly the same – apart from a bit of noise, the limiting factor here is the virtualization bottleneck, not the storage transport.

The only thing left now is to summarize our findings.

Our Findings

We lack the storage speed in our test setup[1. If you’d like to see performance data with actual, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.]:
Even now, without multi-queue capable DRBD, we can already utilize the total 100Gbit Infiniband RDMA bandwidth. Every performance optimization will only move the parallelity and block sizes needed to reach line speed to more typical values.
VM performance is probably acceptable already
If you need performance above the available range (3.2GiB/sec reading, 1.1GiB/sec writing), you’ll want to put your workload on hardware anyway.
It might still get faster by using DRBD within the VM but removing the virtualization delay.
As the used 4.4 kernel does not yet support SR-IOV for the ConnectX-4 cards, we couldn’t test that yet footnote:[Support for SR-IOV should be in the 4.5 series, though…]. In theory, this should give approximately the same speed in the VM as on hardware, as the OS running in the VM should be able to read and write data directly to/from the remote storage nodes.

Conclusion

We may have to follow up soon. In the meantime, the Tech Guide for RDMA performance with non-volatile storage is available online. Head to the LINBIT® Tech Guide area and read the HGST Ultrastar SN150 NVMe performance report! (Free registration required.)

If you have any questions about RDMA Performance or anything else, don’t hesitate to get in touch.

Share this post

More to Explore

Yusuf Yıldız

After nearly 15 years of system and storage management, Yusuf started to work as a solution architect at LINBIT. Yusuf's main focus is on customer success and contributing to product development and testing. As part of the solution architects team, he is one of the backbone and supporter of the sales team.

Talk to us

First name

Last name

Company name

Country

Message

I agree to receive other communications from LINBIT.*

LINBIT is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick above to say how you would like us to contact you.

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Talk to us

First name

Last name

Company name

Country

Message

I agree to receive other communications from LINBIT.*

By clicking submit below, you consent to allow LINBIT to store and process the personal information submitted above to provide you the content requested.

Software-Defined Storage

High Availability

Disaster Recovery

Etc.

Further Solutions

Guides, Manuals, & Training

From Our Community

Knowledge Base

Blog

Using Fencing in Pacemaker Clusters on VirtualBox Hypervisors

LINBIT in the News

Virtual Event – Building Next-Generation IaaS: LINBIT, Apache CloudStack & Ampere

LINBIT Storage Days

Stay Updated with the latest news from LINBIT

Company

Partners

edding ARG

Kickstart your cloud project with the best-in-class block storage solution!

RDMA Performance

Basic Hardware Information

The Software We Used

Using TCP Instead Of RDMA

Reading In A VM, DRBD Handled In Hypervisor

Writing From VM, DRBD In Hypervisor

Our Findings

Conclusion

Recent Posts

Recent Posts

More to Explore

Yusuf Yıldız

Talk to us

Talk to us

Legal

Resources

Company