As promised in the previous RDMA post, we gathered some performance data for the RDMA transport. Read and enjoy!
Basic hardware information:
- Two IBM 8247-22L’s (Power8, 2 sockets * 10 CPUs, hyperthreading turned off)
- 128GiByte RAM
- ConnectX4 Infiniband, two connections with 100Gbit each
- The DRBD TCP connection was run across one “bnx2x” 10Gbit adapter pair (ie. one in each server, no bonding)
dm-zeroas we don’t have fast enough storage available; directly on hardware there was no IO scheduler, within the VM we switched to “noop“.
NOTE: if you’d like to see performance data with real, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.
Software we used:
- Ubuntu Xenial (not officially released at the time of testing)
- Linux Kernel version 4.4.0-15-generic (ppc64el)
- DRBD 9.0.1-1 (ded61af75823)
- DRBD RDMA transport 2.0.0
Our underlying block devices were built to have some “persistent” (ha!) space at the beginning and the end, to keep the DRBD and filesystem superblocks; the rest in the middle was mapped to
zero-block-1: 0 8192 linear 1:1 0 zero-block-1: 8192 2147483648 zero zero-block-1: 2147491840 8192 linear 1:1 8192
Due to the large number of variables and potential test cases, we restricted
fios run-time to 10 seconds (as that should be good enough for statistical purposes, see below).
The graphics only show data for a single thread (but for multiple IO-depths), for ease of reading.
[nextpage title=”DRBD on Hardware”]
For the performance point – here is
fio directly on the hardware (i.e. without a virtualization layer in between).
This graphic shows a few points that should be highlighted:
- For small block sizes (4kiB & 16kiB), the single-threaded/io-depth=1 performance is about 10k IOPsfootnote:[That, times 10 seconds, amounts to 100,000 measurements – a nice statistical base, I believe ;)] : with io-depth=2 it’s 20k IOPs, and when io-depth is 8 or higher, we reach top performance of ~48k IOPs.
- For large block sizes, the best bandwidth result is a tad above 11GiB/sec (sic!)
- Last, but not least, the best latency was below 40µsec! For two threads, io-depth=2, 4KiB block size we had this result:
lat (usec): min=39, max=4038, avg=97.44, stdev=36.22
As a small aside, here’s the same setup, but using TCP instead of RDMA; we kept the same scale to make comparison easier.
As you can see, copying data around isn’t that efficient – TCP is clearly slower, topping out at 1.1GiB/sec on this hardware. (But I have to admit, apart from tcp_rmem and tcp_wmem I didn’t do any tuning here either).
Now, we move on to results from within a VM; let’s start with reading.
[nextpage title=”Reading within a VM”]
The VM sees the DRBD device as /dev/sdb; the scheduler was set to “noop” to not interfere with the read IOs.
Here we get quite nice results, too:
- 3.2GiB/sec, within the VM, should be “good enough” for most purposes, right?
- ~20k IOPs for some io-depth, and still 3.5k IOPs with sequential IO is still better than using harddisks on hardware.
Our next milestone is writing…
[nextpage title=”Writing in the VM”]
Write requests have additional constraints (compared to reading) – every single write request done in the VM has to be replicated (and confirmed) in DRBD in the Hypervisor before the okay is relayed to the VM’s application.
The most visible difference is the bandwidth – it tops out at ~1.1GiB/sec.
Now, these bandwidths were measured in a hyperconverged setup – the host running the VM has a copy of the data available locally. As that might not always be the case, I detached this LV, and tested again.
[nextpage title=”VM with Remote Storage”]
So, if the hypervisor does not have local storage (but always has to ask some other node), we get these pictures:
As we can see, the results are mostly the same – apart from a bit of noise, the limiting factor here is the virtualization bottleneck, not the storage transport.
The only thing left now is a summary and conclusion…
[nextpage title=”Summary and conclusion”]
- We lack the storage speed in our test setup[1. If you’d like to see performance data with real, persistent storage being used, check out our newest Tech Guide – “DRBD9 on an Ultrastar SN150 NVMe SSD“.]:
Even now, without multi-queue capable DRBD, we can already utilize the full 100Gbit Infiniband RDMA bandwidth and every performance optimizations will only move the parallelity and blocksizes needed to reach line speed down to more common values.
- VM performance is probably acceptable already
If you need performance above the now available range (3.2GiB/sec reading, 1.1GiB/sec writing), you’ll want to put your workload on hardware anyway.
- Might get much faster still, by using DRBD within the VM but removing the virtualization delay.
As the used 4.4 kernel does not yet support SR-IOV for the ConnectX-4 cards, we couldn’t test that yet footnote:[Support for SR-IOV should be in the 4.5 series, though…]. In theory this should give approximately the same speed in the VM as on hardware, as the OS running in the VM should be able to read and write data directly to/from the remote storage nodes…
I guess we’ll need to do another follow-up in this series later on … 😉
Questions? Contact firstname.lastname@example.org!