A question we see over and over again is
umountso slow? Why does it take so long?
Part of the answer was already given in an earlier blog post; here’s some more explanation.
write() syscall typically writes into RAM only. In Linux we call that “page cache“, or “buffer cache“, depending on what exactly the actual target of the
write() system call was.
From that RAM (cache inside the operating system, high in the IO stack) the operating system does periodically do writeouts, at its leisure, unless it is urged to write out particular pieces (or all of it) now.
fdatasync(), or …) does exactly that: it urges the operating system to do the write out.
umount also causes a write out of all not yet written data of the affected file system.
- Of course the “performance” of writes that go into volatile RAM only will be much better than anything that goes to stable, persistent, storage. All things that have only been written to cache but not yet synced (written out to the block layer) will be lost if you have a power outage or server crash.
The linux block layer has never seen these changes, DRBD has never seen these changes, they cannot possibly be replicated anywhere.
Data will be lost.
There are also controller caches which may or may not be volatile, and disk caches, which typically are volatile. These are below and outside the operating system, and not part of this discussion. Just make sure you disable all volatile caches on that level.
Now, for a moment, assume
- you don’t have DRBD in the stack, and
- a moderately capable IO backend that writes, say, 300 MByte/s, and
- around 3 GiByte of dirty data around at the time you trigger the umount, and
- you are not seek-bound, so your backend can actually reach that 300 MB/s,
you get a umount time of around 10 seconds.
Still with me?
Ok. Now, introduce DRBD to your IO stack, and add a long distance replication link. Just for the sake of me trying to explain it here, assume that because it is long distance and you have a limited budget, you can only afford 100 MBit/s. And “long distance” implies larger round trip times, so lets assume we have a RTT of 100 ms.
Of course that would introduce a single IO request latency of > 100 ms for anything but DRBD protocol A, so you opt for protocol A. (In other words, using protocol A “masks” the RTT of the replication link from the application-visible latency.)
That was latency.
But, the limited bandwidth of that replication link also limits your average sustained write throughput, in the given example to about 11MiByte/s.
The same 3 GByte of dirty data would now drain much slower, in fact that same
umount would now take not 10 seconds, but 5 minutes.
So, concluding: try to avoid having much unsaved data in RAM; it might bite you. For example, you want your cluster to do a switchover, but the umount takes too long and a timeout hits: the node (should) get fenced, and the data not written to stable storage will be lost.
Please follow the advice about setting some sysctls to start write-out earlier!