Emergency is just another word for incompetency

February 9, 2010

In the system administration world, unexpected events need to be expected.

Your hard drives failed?  The Mean Time Between Failure (MTBF) gave you a statistical prediction this was going to happen, you should have planned for it.

A network partition occurred?  All networks have the ability for multiple paths, and the cost-benefit algorithm allows adjustment for how much redundancy can be created to plan for all levels of internal and vendor failures.  The CAP (Consistency, Availability / Partitions) principle assists us further, telling us that in any networked environment, we either have to focus on consistency, with periods of unavailability or availability with eventual consistency.  In either case, if properly planned and implemented, a solution can provide an acceptable result and not create an emergency.

If fire drill work is erupting frequently enough that your team cannot replace legacy infrastructure fast enough, and is primarily tasked with responding to fires, then there is a systemic problem in place.

The solution to all of these problems is organization, planning, applying expertise to problem domains, having designs and work reviewed by qualified peers, and the all-encompassing requirement to care about the quality of work performed, levels of robustness, and comprehensive failure plans, both through automation and human processes.

When you are experiencing frequent emergencies, it is time to look inward towards your processes and approaches to your work, because some things are being done poorly enough to cause these emergencies, and failure to reflect and change will not lead to anything but more emergencies.