Posts

Linstor world record 14,8 million

IOPS World Record Broken – LINBIT Tops 14.8 million IOPS

In a performance test LINBIT measured 14.8 million IOPS on a 12 node cluster built from standard off-the-shelf Intel servers. This is the highest storage performance reached by a hyper-converged system on the market, for this hardware basis. Even a small LINBIT storage system can provide millions of IOPS at latencies of a fraction of a millisecond. For real-world applications, these figures correspond to outstanding application performance.

Test setup

LINBIT chose this setup because our competitors have published test results from equivalent systems. So it is easy to compare the strengths of each software offering with fair conditions and the same environment. We worked hard to get the most of the system and made it! Microsoft managed to reach 13.7 million IOPS, and Storpool marginally topped that with 13.8 million IOPS. We reached 14.8 million remote read IOPS – a significant jump of 7.2%! “Those performance numbers mark a milestone in the development of our software. The results prove we speed up High Availability at a large scale”, says CEO Philipp Reisner. The numbers would scale up even further with a larger setup.

Linstor world record 14,8 million

 

These exciting results are for 3-way synchronous replication using DRBD. The test cluster was provided through the Intel®️ Data Center Builders program. It consists of 12 servers, each running 8 instances of the benchmark, making a total of 96 instances. The setup is hyper-converged, meaning that the same servers are used to run the benchmark and to provide the underlying storage.

 

Linstor world record 14,8 million server

 

For some benchmarks, one of the storage replicas is a local replica on the same node as the benchmark workload itself. This is a particularly effective configuration for DRBD.

DRBD provides a standard Linux block device, so it can be used directly from the host, from a container, or from a virtual machine. For these benchmarks, the workload runs in a container, demonstrating the suitability of LINBIT’s SDS solution, which consists of DRBD and LINSTOR, for use with Kubernetes.

IOPS and bandwidth results are the totals from all 96 workload instances. Latency results are averaged.

Let’s look into the details!

Top performance with DRBD

5.0 million synchronously replicated write IOPS

This was achieved with a 4K random write benchmark with an IO depth of 64 for each workload. The setup uses Intel® Optane™ DC Persistent Memory to store the DRBD metadata. The writes are 3-way replicated with one local replica and two remote replicas. This means that the backing storage devices are writing at a total rate of 15 million IOPS.

85μs synchronously replicated write latency

This was achieved with a 4K random write benchmark with serial IO. That is, an IO depth of 1. This means that the writes were persisted to all 3 replicas within an average time of only 85μs. DRBD attained this level of performance both when one of the replicas was local and when all were remote. The setup also uses Intel® Optane™ DC Persistent Memory to store metadata.

14.8 million remote read IOPS

This was achieved with a 4K random read benchmark with an IO depth of 64. This corresponds to 80% of the total theoretical network bandwidth of 75GB/s. This result was reproduced without any usage of persistent memory so that the value can be compared with those from our competitors.

10.6 million IOPS with 70/30 mixed read/write

Representing a more typical real-world scenario, this benchmark consists of 70% reads and 30% writes and used an IO depth of 64. One of the 3 replicas was local.

YouTube

By loading the video, you agree to YouTube's privacy policy.
Learn more

Load video

Benefits of persistent memory

DRBD is optimized for persistent memory. When the DRBD metadata is stored on an NVDIMM, write performance is improved.

When the metadata is stored on the backing storage SSD with the data, DRBD can process 4.5 million write IOPS. This increases to 5.0 million when the metadata is stored on Intel® Optane™ DC Persistent Memory instead, an improvement of 10%.

Moving the metadata onto persistent memory has a particularly pronounced effect on the write latency. This metric plummets from 113μs to 85μs with this configuration change. That is, the average write is 25% faster.

Detailed results

Below are the full results for DRBD running on the 12 servers with a total of 96 benchmark workloads.

Benchmark name Without local replica With local replica
Random read

(higher is better)

14,800,000 IOPS 22,100,000 IOPS
Random read/write 70/30

(higher is better)

8,610,000 IOPS 10,600,000 IOPS
Random write

(higher is better)

4,370,000 IOPS 5,000,000 IOPS
Sequential read

(higher is better)

64300 MB/s 111000 MB/s
Sequential write

(higher is better)

20700 MB/s 23200 MB/s
Read latency

(lower is better)

129 μs 82 μs
Write latency

(lower is better)

85 μs 84 μs

The IOPS and MB/s values have been rounded down to 3 significant figures.

All volumes are 500GiB in size, giving a total active set of 48,000GiB and consuming a total of 144,000GiB of the underlying storage. The workloads are generated using the fio tool with the following parameters:

Benchmark type Block size IO depth Workload instances Total active IOs
Random 4K 64 96 6144
Sequential 128K 16 96 1536
Latency 4K 1 96 96

Quality controls

In order to ensure that the results are reliable, the following controls were applied:

  • The entire dataset was written after allocating the volumes, but before running the tests. This prevents artificially fast reads of unallocated blocks. When the backing device driver or firmware recognizes that an unallocated block is being read, it may simply return zeros without reading from the physical medium.
  • The benchmark uses direct IO to bypass the operating system cache and the working set was too large to be cached in memory in any case.
  • The tests were each run for 10 minutes. The metrics stabilized within a small proportion of this time.
  • The measurements were provided by the benchmarking tool itself, rather than being taken from a lower level such as the DRBD statistics. This ensures that the performance corresponds to that which a real application would experience.
  • The random pattern used for the benchmark used a random seed to avoid any bias due to the same blocks being chosen by subsequent test runs.

Software stack

The following key software components were used for these benchmarks:

  • Distribution: CentOS 8.0.1905
  • Kernel: Linux 4.18.0-80.11.2.el8_0.x86_64
  • LVM from distribution kernel
  • DRBD 9.0.21-1
  • Docker 19.03.5
  • Fio 3.7

Terminology

In this text and at LINBIT, in general, we use the expression 2 replicas to indicate that the data is stored on 2 storage devices. For these tests, there are 3 replicas, meaning that the data is stored on 3 storage devices.

In other contexts, the expression 2 replicas might mean one original plus 2 replicas. That would mean that data would be stored on 3 storage devices.

Test infrastructure

These results were obtained on a cluster of 12 servers made available as part of the Intel® Data Center Builders program. Each server was equipped with the following configuration:

  • Processor: 2x Intel® Xeon Platinum 8280L CPU
  • Memory: 384GiB DDR4 DRAM
  • Persistent memory: 4x 512GB Intel® Optane™ DC Persistent Memory
  • Storage: At least 4x Intel® SSD DC P4510 of at least 4TB
  • Network: Intel® Ethernet Network Adapter XXV710 with dual 25GbE ports

The servers were all connected in a simple star topology with a 25Gb switch.

YouTube

By loading the video, you agree to YouTube's privacy policy.
Learn more

Load video

Speed is of the essence

Storage has often been a bottleneck in modern IT environments. The two requirements speed and high availability have always been in competition. If you aim for maximum speed, the quality of the high availability tends to suffer and vice versa. But with this performance test we demonstrate the best-of-breed open source software-defined storage solution. A replicated storage system that combines high availability and the performance of local NVMe drives is now possible.

This technology enables any public and private cloud builder to deliver high performance for their applications, VMs and containers. If you aim to build a powerful private or public cloud, our solution meets your storage performance needs.

If you want to learn more or have any questions, do contact us at [email protected]

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.

 

drbd-performance-pmem3

Optimizing DRBD for Persistent Memory

Persistent Memory (PMEM) is an exciting storage tier with much lower latency than SSDs. LINBIT has optimized DRBD for when its metadata is stored on PMEM/NVDIMM.

This article relates to both:

  • Traditional NVDIMM-N: Some DRAM is accompanied by NAND-flash. On power failure, a backup power source (supercap, battery) is used to save the contents of the DRAM to the flash storage. When the main power is restored, the contents of the DRAM are restored. These components have exactly the same timing characteristics as DRAM and are available in sizes of 8GB, 16GB or 32GB per DIMM.
  • Intel’s new Optane DC Persistent Memory: These DIMMs are built using a new technology called 3D XPoint. It is inherently non-volatile and has only slightly higher access times than pure DRAM. It comes in much higher capacities than traditional NVDIMMs: 128GB, 256GB and 512GB.

DRBD requires metadata to keep track of which blocks are in sync with its peers. This consists of 2 main parts. One part is the bitmap which keeps track of exactly which 4KiB blocks may be out of sync. It is used when a peer is disconnected to minimize the amount of data that must be synced when the nodes reconnect. The other part is the activity log. This keeps track of which data regions have ongoing I/O or had I/O activity recently. It is used after a crash to ensure that the nodes are fully in sync. It consists of 4MiB extents which, by default, cover about 5GiB of the volume.

Since the DRBD metadata is small and frequently accessed, it is a perfect candidate to be placed on PMEM. A single 8GiB NVDIMM can store enough metadata for 100 volumes of 1TiB each, allowing for replication between 3 nodes.

PMEM outperforms

DRBD 9 has been optimized to access metadata on PMEM directly using memory operations. This approach is extremely efficient and leads to significant performance gains. The improvement is most dramatic when the metadata is most often updated. This occurs when writes are performed serially. That is, the I/O depth is 1. When this is the case, scattered I/O forces the activity log to be updated on every write. Here we compare the performance between metadata on a separate NVMe device, and metadata on PMEM with and without the optimizations.

As can be seen, placing the DRBD metadata on a PMEM device results in a massive performance boost for this kind of workload.

drbd-performance-pmem

Impact with concurrent I/O

When I/O is submitted concurrently, DRBD does not have to access the metadata as often. Hence we do not expect the performance impact to be quite as dramatic. Nevertheless, there is still a significant performance boost, as can be seen.

If you have a workload with very high I/O depth, you may wish to trial DRBD 10, which performs especially well in such a situation. See https://www.linbit.com/en/drbd10-vs-drbd9-performance/.

drbd-performance-pmem-io

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. Each server had an 8GiB DDR4 NVDIMM from Viking installed. DRBD 9.0.17 was used to perform the tests without the PMEM optimizations and DRBD 9.0.20 for the remainder. I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8

--ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1

--runtime=60 --size=48G --filename=/dev/drbd500

 

If you got technical questions, don’t hesitate to subscribe to our email list

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.
Performance drbd10 drbd9

Performance Gains with DRBD 10

A key factor in evaluating storage systems is their performance. LINBIT has been working to further improve the performance of DRBD. The recent DRBD 10 alpha release demonstrates significant gains.

The performance gains particularly help with highly concurrent workloads. This is an area that has been steadily rising in importance, and looks set to continue to rise. Improvements in single core speed appear to be stagnating while the availability of ever increasing numbers of cores is growing. Hence software systems need to utilize concurrency effectively to make the most of the computing resources.

We tested DRBD 10 with 4K random writes and various concurrency levels. In this test, the data is being replicated synchronously (“protocol C”) between two nodes. These numbers are for a single volume, not an aggregate over many volumes. I/O was generated by 8 processes. The tests show improvements in raw random write performance of up to 68%.

drbd10 performance gains

These improvements were achieved by using a finer-grained locking scheme. This allows, for instance, one core to be sending a request while a second core is submitting the next request. The result is better utilization of the available cores and overall higher throughput.

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. The software versions used were DRBD 10.0.0a1 and its most recent ancestor from the DRBD 9 branch (8e93a5d93b62). I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8 --ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1 --runtime=60 --size=48G --filename=/dev/drbd500

Ongoing development on DRBD 10

LINBIT is working on a number of exciting major features for DRBD 10.

  • Request forwarding. DRBD will send data to geographically distant sites only once and it will be replicated there.
  • PMEM journaling. DRBD can already access its metadata in a PMEM optimized fashion. That will be extended to using a PMEM device as a write-back cache, resulting in improved performance in latency-sensitive scenarios.
  • Erasure coding. DRBD will be able to erasure code and distribute its data. This provides the same functionality as RAID5/6, but with an arbitrary number of parity nodes. The result is lower disk usage with similar redundancy characteristics.

Stable releases of DRBD 10 are planned for 2020 – until then stay tuned for upcoming updates!

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.