Monitoring: better safe than sorry…

Stumbling upon the Holy time-travellin’ DRBD, batman! blog post there’s only one thing to be said …

Be strict in what you emit, liberal in what you accept[1. Thanks, Larry]

is simply not true when dealing with mission-critical systems.

It’s ok to be alerted on upgrading a machine because the “old, working” RegEx that did the parsing doesn’t match anymore[1. eg. because /proc/drbd got an additional field]; it’s not a problem to get an email when someone adds the 100th DRBD resource and causes the grep to fail; and so on.

Better to have a few false positives when you’re actively changing things than to get a false negative that costs you months of data; that’s what an assert (and monitoring isn’t that different) is for, after all.

Keep monitoring strict, and let it fail loudly on unexpected things – after the first few occurrences they’re not unexpected anymore and can be dealt with.