System maintenance, whether planned or in response to failure, is a necessary part of managing infrastructure. Everyone hopes for the former, rather than the latter. We do our system maintenance quarterly here at LINBIT in hopes that the later is avoided. These maintenance windows are where we install hardware and software updates, test failovers, and give everything a once over to ensure configurations still make sense.
Normally in the case of planned maintenance, users are left waiting for access while IT does whatever they need to do. This leads to a bad user experience. In fact, that is precisely what lead to this blog post. I was looking for a BIOS update for a motherboard in the server room and was presented with this lovely message:
I just had a bad user experience. And to further the experience, I have no indication as to when it will be back up or available. I guess I’m supposed to keep checking back until I get what I was looking for… if I remember to.
Here at LINBIT we use DRBD for all of our systems. This ensures that they are always on and always available for the end users and our customers. If for some reason you landed on this site and aren’t familiar with DRBD, DRBD is an open source project developed by us, LINBIT. In its simplest form you can think of it as network raid 1, however instead of having independent disks, you have two (or more if you’re using DRBD9) independent systems. You essentially now need to lose twice the hardware to experience downtime of services.
One commonly ignored or unrealized benefit of using DRBD is that system maintenance and upgrades can be done with minimal to no interruption of services. The length of the interruption is generally tied to the type of deployment – for example if you’re using virtual machines, live migration can be achieved using DRBD resulting in no downtime. If you’re running services on hardware and they need to be stopped and restarted, your downtime will be limited to the failover time.
So how do we do this? Let say you have two servers; Frodo and Sam – Frodo is Primary (running services) and Sam is Secondary. In this example we need to update the BIOS and upgrade the RAM of our servers. Follow these steps
- First put the cluster into maintenance mode
- Next power off Sam (the secondary server)
- We can now install any upgrades or hardware we need to
- Power the system up, enter the BIOS and make sure everything is OK
- Reboot and update the BIOS
- Boot Sam into the OS
- At this point you can install any OS updates and reboot again if needed
- Once Sam is back up and everything is verified to be in good condition, bring the cluster out of maintenance mode
- Now migrate services to Sam – again depending on how things are configured this may or may not cause a few seconds of unavailability of services
- Repeat steps 1-4 for Frodo
There you have it, one of the better kept secret benefits of using DRBD.