Posts

Server Maintenance

Minimize Downtime During Maintenance

System maintenance, whether planned or in response to failure, is a necessary part of managing infrastructure. Everyone hopes for the former, rather than the latter. We do our system maintenance quarterly here at LINBIT in hopes that the latter is avoided. These maintenance windows are where we install hardware and software updates, test failovers, and give everything a once over to ensure configurations still make sense.

Normally in the case of planned maintenance, users are left waiting for access while IT does whatever they need to do. This leads to a bad user experience. In fact, that is precisely what lead to this blog post. I was looking for a BIOS update for a motherboard in the server room and was presented with this lovely message:

We are sorry!

I just had a bad user experience. And to further the experience, I have no indication as to when it will be back up or available. I guess I’m supposed to keep checking back until I get what I was looking for… if I remember to.

Here at LINBIT we use DRBD for all of our systems. This ensures that they are always on and always available for the end users and our customers. If for some reason you landed on this site and aren’t familiar with DRBD, DRBD is an open source project developed by us, LINBIT. In its simplest form you can think of it as network raid 1, however instead of having  independent disks, you have two (or more if you’re using DRBD9) independent systems. You essentially now need to lose twice the hardware to experience downtime of services.

One commonly ignored or unrealized benefit of using DRBD is that system maintenance and upgrades can be done with minimal to no interruption of services. The length of the interruption is generally tied to the type of deployment – for example if you’re using virtual machines, live migration can be achieved using DRBD resulting in no downtime. If you’re running services on hardware and they need to be stopped and restarted, your downtime will be limited to the failover time.

So how do we do this? Let say you have two servers; Frodo and Sam – Frodo is Primary (running services) and Sam is Secondary. In this example we need to update the BIOS and upgrade the RAM of our servers. Follow these steps

  1. First put the cluster into maintenance mode
  2. Next power off Sam (the secondary server)
    1. We can now install any upgrades or hardware we need to
    2. Power the system up, enter the BIOS and make sure everything is OK
    3. Reboot and update the BIOS
  3. Boot Sam into the OS
    1. At this point you can install any OS updates and reboot again if needed
  4. Once Sam is back up and everything is verified to be in good condition, bring the cluster out of maintenance mode
  5. Now migrate services to Sam – again depending on how things are configured this may or may not cause a few seconds of  unavailability of services
  6. Repeat steps 1-4 for Frodo

There you have it, one of the better kept secret benefits of using DRBD.

Control and Data plane Linstor

The advantage of separate control and data planes

Many storage systems have a monolithic design that combines the control plane and the data plane into a single application and a single protocol, but LINBIT’s more modular solution comes with a number of advantages.

What is a control plane or a data plane?

The most important task that any storage system must perform is providing access to the storage volumes that are used for various workloads, for example, databases, file servers or virtualization environments. This is what we refer to as the data plane – all the components that are necessary to actually get data from the storage to the user and from the user to the storage.

Another task is the management of the configuration of storage volumes, which is what we refer to as the control plane . With the rise of more dynamic systems like containerization, virtualization and cloud environments, and the associated software defined storage systems, where storage volumes are frequently reconfigured, this task is becoming increasingly important.

Data and Control plane Linstor

Why it is important: Availability

If you need to shut down part of your infrastructure, because you are updating hardware, for instance it is important when the most fundamental services remain available. Storage is probably one of those fundamental and important services, since most of the other systems rely on it.

A storage system with a modular design that provides independent control and data planes brings your infrastructure one step closer to high availability.

Independent control and data plane

Many storage systems can only provide access to storage volumes if all of their subsystems are online. The design may even be completely monolithic, so that the management functions and the storage access functions are contained within a single application that uses a single network protocol.

In LINBIT’s DRBD-based storage systems, only the most fundamental control plane functions are tightly coupled with the data plane and the operation of storage volumes. High-level control functions, like managing storage volumes and their configuration, managing cluster nodes, or providing automatic selection of cluster nodes for the creation of storage volumes, are provided by the LINSTOR storage management software. These two components, DRBD and LINSTOR, are fundamentally independent of each other.

DRBD storage volumes, even those that are managed by LINSTOR, are kept accessible even if the LINSTOR software is unavailable. This means that the LINSTOR software can be shut down, restarted or upgraded while users retain their access to existing storage volumes. While it is less useful, the same is even true the other way around: a LINSTOR controller that does not rely on storage provided by DRBD and will continue to service storage management requests even if the storage system itself is unavailable. The changed configuration will simply be applied whenever the actual storage system is online again.

Robert Altnoeder on Linkedin
Robert Altnoeder
Robert joined the LINBIT development team in 2013. He had worked with
DRBD at a startup company in the SaaS field before joining LINBIT. His
current primary field of work is the architecture and implementation of
LINSTOR, the cluster management component of LINBIT's SDS software.